3.6 Exercises
3.1 Growth function of intervals in . Let be the set of intervals in . The VC-dimension of is 2. Compute its shattering coefficient . Compare your result with the general bound for growth functions.
3.2 Growth function and Rademacher complexity of thresholds in . Let be the family of threshold functions over the real line: . Give an upper bound on the growth function . Use that to derive an upper bound on .
3.3 Growth function of linear combinations. A linearly separable labeling of a set of vectors in is a classification of into two sets and with and for some .
Let be a subset of .
(a) Let be a dichotomy of and let . Show that and are linearly separable by a hyperplane going through the origin if and only if is linearly separable by a hyperplane going through the origin and .
(b) Let be a subset of such that any -element subset of with is linearly independent. Then, show that the number of linearly separable labelings of is . (Hint: prove by induction that .
(c) Let be functions mapping to . Define as the family of classifiers based on linear combinations of these functions:
Define by . Assume that there exists such that every -subset of is linearly independent. Then, show that
3.4 Lower bound on growth function. Prove that Sauer’s lemma (theorem 3.17) is tight, i.e., for any set of elements, show that there exists a hypothesis class of VC-dimension such that .
3.5 Finer Rademacher upper bound. Show that a finer upper bound on the Rademacher complexity of the family can be given in terms of , where is the number of ways to label the points in sample .
3.6 Singleton hypothesis class. Consider the trivial hypothesis set .
(a) Show that for any .
(b) Use a similar construction to show that Massart’s lemma (theorem 3.7) is tight.
3.7 Two function hypothesis class. Let be a hypothesis set reduced to two functions: and let be a sample of size .
(a) Assume that is the constant function taking value -1 and the constant function taking the value +1 . What is the VC-dimension of ? Upper bound the empirical Rademacher complexity (Hint: express in terms of the absolute value of a sum of Rademacher variables and apply Jensen’s inequality) and compare your bound with .
(b) Assume that is the constant function taking value -1 and the function taking value -1 everywhere except at where it takes the value +1. What is the VC-dimension of ? Compute the empirical Rademacher complexity .
3.8 Rademacher identities. Fix . Prove the following identities for any and any two hypothesis sets and of functions mapping from to :
(a) .
(b) .
(c) ,
where denotes the function (Hint: you could use the identity valid for all and Talagrand’s contraction lemma (see lemma 5.7)).
3.9 Rademacher complexity of intersection of concepts. Let and be two families of functions mapping to and let . Show that the empirical Rademacher complexity of for any sample of size can be bounded as follows:
Hint: use the Lipschitz function and Talagrand’s contraction lemma.
Use that to bound the Rademacher complexity of the family of intersections of two concepts and with and in terms of the Rademacher complexities of and .
3.10 Rademacher complexity of prediction vector. Let be a sample of size and fix .
(a) Denote by the vector of predictions of for . Give an upper bound on the empirical Rademacher complexity of in terms of (Hint: express in terms of the expectation of an absolute value and apply Jensen’s inequality). Suppose that for all . Express the bound on the Rademacher complexity in terms of the sparsity measure . What is that upper bound for the extreme values of the sparsity measure?
(b) Let be a family of functions mapping to . Give an upper bound on the empirical Rademacher complexity of and that of in terms of and .
3.11 Rademacher complexity of regularized neural networks. Let the input space be . In this problem, we consider the family of regularized neural networks defined by the following set of functions mapping to :
where is an -Lipschitz function. As an example, could be the sigmoid function which is 1-Lipschitz.
(a) Show that .
(b) Use the following form of Talagrand’s lemma valid for all hypothesis sets and -Lipschitz function :
to upper bound in terms of the empirical Rademacher complexity of , where is defined by
(c) Use the Cauchy-Schwarz inequality to show that
(d) Use the inequality , which holds by Jensen’s inequality to upper bound .
(e) Assume that for all for some . Use the previous questions to derive an upper bound on the Rademacher complexity of in terms of .
3.12 Rademacher complexity. Professor Jesetoo claims to have found a better bound on the Rademacher complexity of any hypothesis set of functions taking values in , in terms of its VC-dimension VCdim . His bound is of the form . Can you show that Professor Jesetoo’s claim cannot be correct? (Hint: consider a hypothesis set reduced to just two simple functions.)
3.13 VC-dimension of union of intervals. What is the VC-dimension of subsets of the real line formed by the union of intervals?
3.14 VC-dimension of finite hypothesis sets. Show that the VC-dimension of a finite hypothesis set is at most .
3.15 VC-dimension of subsets. What is the VC-dimension of the set of subsets of the real line parameterized by a single parameter ?
3.16 VC-dimension of axis-aligned squares and triangles.
(a) What is the VC-dimension of axis-aligned squares in the plane?
(b) Consider right triangles in the plane with the sides adjacent to the right angle both parallel to the axes and with the right angle in the lower left corner. What is the VC-dimension of this family?
3.17 VC-dimension of closed balls in . Show that the VC-dimension of the set of all closed balls in , i.e., sets of the form for some and , is less than or equal to .
3.18 VC-dimension of ellipsoids. What is the VC-dimension of the set of all ellipsoids in ?
3.19 VC-dimension of a vector space of real functions. Let be a finite-dimensional vector space of real functions on . Let be the set of hypotheses:
Show that , the VC-dimension of , is finite and that . (Hint: select an arbitrary set of points and consider linear mapping defined by: .)
3.20 VC-dimension of sine functions. Consider the hypothesis family of sine functions (Example 3.16): .
(a) Show that for any the points and cannot be shattered by this family of sine functions.
(b) Show that the VC-dimension of the family of sine functions is infinite. (Hint: show that can be shattered for any .)
3.21 VC-dimension of union of halfspaces. Provide an upper bound on the VC-dimension of the class of hypotheses described by the unions of halfspaces.
3.22 VC-dimension of intersection of halfspaces. Consider the class of convex intersections of halfspaces. Give lower and upper bound estimates for .
3.23 VC-dimension of intersection concepts.
(a) Let and be two concept classes. Show that for any concept class
{\Pi }_{\mathcal{C}}\left( m\right) \leq {\Pi }_{{\mathcal{C}}_{1}}\left( m\right) {\Pi }_{{\mathcal{C}}_{2}}\left( m\right) \tag{3.53}
(b) Let be a concept class with VC-dimension and let be the concept class formed by all intersections of concepts from . Show that the VC-dimension of is bounded by . (Hint: show that for any .)
3.24 VC-dimension of union of concepts. Let and be two sets of functions mapping from into , and assume that both and have finite VC-dimension, with and . Let be the union of and .
(a) Prove that for all .
(b) Use Sauer’s lemma to show that for , and give a bound on the VC-dimension of .
3.25 VC-dimension of symmetric difference of concepts. For two sets and , let denote the symmetric difference of and , i.e., . Let be a non-empty family of subsets of with finite VC-dimension. Let be an element of and define . Show that
3.26 Symmetric functions. A function is symmetric if its value is uniquely determined by the number of 1’s in the input. Let denote the set of all symmetric functions.
(a) Determine the VC-dimension of .
(b) Give lower and upper bounds on the sample complexity of any consistent PAC learning algorithm for .
(c) Note that any hypothesis can be represented by a vector , where is the value of on examples having precisely ‘s. Devise a consistent learning algorithm for based on this representation.
3.27 VC-dimension of neural networks. {#vc-dimension-of-neural-networks. .unnumbered}
Let be a concept class over with VC-dimension . A -neural network with one intermediate layer is a concept defined over that can be represented by a directed acyclic graph such as that of Figure 3.7, in which the input nodes are those at the bottom and in which each other node is labeled with a concept .
The output of the neural network for a given input vector is obtained as follows. First, each of the input nodes is labeled with the corresponding value . Next, the value at a node in the higher layer and labeled with is obtained by applying to the values of the input nodes admitting an
::: center {width=“30%”} :::
Figure 3.7
A neural network with one intermediate layer.
edge ending in . Note that since takes values in , the value at is in . The value at the top or output node is obtained similarly by applying the corresponding concept to the values of the nodes admitting an edge to the output node.
(a) Let denote the set of all neural networks defined as above with internal nodes. Show that the growth function can be upper bounded in terms of the product of the growth functions of the hypothesis sets defined at each intermediate layer.
(b) Use that to upper bound the VC-dimension of the C-neural networks (Hint: you can use the implication valid for , and with ).
(c) Let be the family of concept classes defined by threshold functions . Give an upper bound on the VC-dimension of in terms of and .
3.28 VC-dimension of convex combinations. Let be a family of functions mapping from an input space to and let be a positive integer. Give an upper bound on the VC-dimension of the family of functions defined by
(Hint: you can use exercise 3.27 and its solution).
3.29 Infinite VC-dimension.
(a) Show that if a concept class has infinite VC-dimension, then it is not PAC-learnable.
(b) In the standard PAC-learning scenario, the learning algorithm receives all examples first and then computes its hypothesis. Within that setting, PAC-learning of concept classes with infinite VC-dimension is not possible as seen in the previous question.
Imagine now a different scenario where the learning algorithm can alternate between drawing more examples and computation. The objective of this problem is to prove that PAC-learning can then be possible for some concept classes with infinite VC-dimension.
Consider for example the special case of the concept class of all subsets of natural numbers. Professor Vitres has an idea for the first stage of a learning algorithm PAC-learning . In the first stage, draws a sufficient number of points such that the probability of drawing a point beyond the maximum value observed be small with high confidence. Can you complete Professor Vitres’ idea by describing the second stage of the algorithm so that it PAC-learns ? The description should be augmented with the proof that can PAC-learn C.
3.30 VC-dimension generalization bound - realizable case. In this exercise we show that the bound given in corollary 3.19 can be improved to in the realizable setting. Assume we are in the realizable scenario, i.e. the target concept is included in our hypothesis class . We will show that if a hypothesis is consistent with a sample then for any such that
\mathbb{P}\left\lbrack {R\left( h\right) > \epsilon }\right\rbrack \leq 2{\left\lbrack \frac{2em}{d}\right\rbrack }^{d}{2}^{-{m\epsilon }/2}. \tag{3.54}
(a) Let be the subset of hypotheses consistent with the sample , let denote the empirical error with respect to the sample and define as another independent sample drawn from . Show that the following inequality holds for any :
where is a binomial random variable with parameters . (Hint: prove and use the fact that .)
(b) Prove that . Use this inequality along with the result from (a) to show that for any
(c) Instead of drawing two samples, we can draw one sample of size then uniformly at random split it into and . The right hand side of part (b) can then be rewritten as:
Let be a hypothesis such that and let be the total number of errors makes on . Show that the probability of all errors falling into is upper bounded by .
(d) Part (b) implies that for any
Use this bound to show that for any
(e) Complete the proof of inequality (3.54) by using the union bound to upper bound . Show that we can achieve a high probability generalization bound that is of the order .
3.31 Generalization bound based on covering numbers. Let be a family of functions mapping to a subset of real numbers . For any , the covering number of for the norm is the minimal such that can be covered with balls of radius , that is, there exists such that, for all , there exists with . In particular, when is a compact set, a finite covering can be extracted from a covering of with balls of radius and thus is finite.
Covering numbers provide a measure of the complexity of a class of functions: the larger the covering number, the richer is the family of functions. The objective of this problem is to illustrate this by proving a learning bound in the case of the squared loss. Let denote a distribution over according to which
labeled examples are drawn. Then, the generalization error of for the squared loss is defined by and its empirical error for a labeled sample by . We will assume that is bounded, that is there exists such that for all . The following is the generalization bound proven in this problem:
\underset{S \sim {\mathcal{D}}^{m}}{\mathbb{P}}\left\lbrack {\mathop{\sup }\limits_{{h \in \mathcal{H}}}\left| {R\left( h\right) - {\widehat{R}}_{S}\left( h\right) }\right| \geq \epsilon }\right\rbrack \leq \mathcal{N}\left( {\mathcal{H},\frac{\epsilon }{8M}}\right) 2\exp \left( \frac{-m{\epsilon }^{2}}{2{M}^{4}}\right) . \tag{3.55}
The proof is based on the following steps.
(a) Let , then show that for all and any labeled sample , the following inequality holds:
(b) Assume that can be covered by subsets , that is . Then, show that, for any , the following upper bound holds:
(c) Finally, let and let be balls of radius centered at covering . Use part (a) to show that for all ,
and apply Hoeffding’s inequality (theorem D.2) to prove (3.55).