Vapnik–Chervonenkis theory

Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

Introduction

VC theory covers at least four parts (as explained in The Nature of Statistical Learning Theory

= \sup_{f \in \mathcal{F \dfrac{1}{n} \left|\sum_{i = 1}^n f(X_i) - \mathbb{E} f(Y_i) \right| \leq \mathbb{E}_{Y} \sup_{f \in \mathcal{F \dfrac{1}{n} \left|\sum_{i = 1}^n f(X_i) - f(Y_i) \right|</math>

Therefore, by Jensen's inequality:

:<math>\Phi(\|\mathbb{P}_n - P\|_{\mathcal{F) \leq \mathbb{E}_{Y} \Phi \left(\left\| \dfrac{1}{n}\sum_{i = 1}^n f(X_i) - f(Y_i) \right\|_{\mathcal{F \right)</math>

Taking expectation with respect to <math>X</math> gives:

:<math>\mathbb{E}\Phi(\|\mathbb{P}_n - P\|_{\mathcal{F) \leq \mathbb{E}_{X} \mathbb{E}_{Y} \Phi \left(\left\| \dfrac{1}{n}\sum_{i = 1}^n f(X_i) - f(Y_i) \right\|_{\mathcal{F\right)</math>

Note that adding a minus sign in front of a term <math>f(X_i) - f(Y_i)</math> doesn't change the RHS, because it's a symmetric function of <math>X</math> and <math>Y</math>. Therefore, the RHS remains the same under "sign perturbation":

:<math>\mathbb{E} \Phi \left( \left\| \dfrac{1}{n}\sum_{i = 1}^n e_i \left(f(X_i) - f(Y_i)\right) \right\|_{\mathcal{F \right) </math>

for any <math>(e_1,e_2,\ldots,e_n) \in \{-1,1\}^n</math>. Therefore:

:<math>\mathbb{E}\Phi(\|\mathbb{P}_n - P\|_{\mathcal{F) \leq \mathbb{E}_{\varepsilon} \mathbb{E} \Phi \left(\left\| \dfrac{1}{n}\sum_{i=1}^n \varepsilon_i \left(f(X_i) - f(Y_i)\right) \right\|_{\mathcal{F \right)</math>

Finally using first triangle inequality and then convexity of <math>\Phi</math> gives:

:<math>\mathbb{E}\Phi(\|\mathbb{P}_n - P\|_{\mathcal{F) \leq \dfrac{1}{2}\mathbb{E}_{\varepsilon} \mathbb{E} \Phi \left( 2 \left \| \dfrac{1}{n}\sum_{i = 1}^n \varepsilon_i f(X_i)\right\|_{\mathcal{F \right) + \dfrac{1}{2}\mathbb{E}_{\varepsilon} \mathbb{E} \Phi \left( 2 \left\| \dfrac{1}{n}\sum_{i = 1}^n \varepsilon_i f(Y_i)\right\|_{\mathcal{F \right)</math>

Where the last two expressions on the RHS are the same, which concludes the proof.

A typical way of proving empirical CLTs, first uses symmetrization to pass the empirical process to <math>\mathbb{P}_n^0</math> and then argue conditionally on the data, using the fact that Rademacher processes are simple processes with nice properties.

VC Connection

It turns out that there is a fascinating connection between certain combinatorial properties of the set <math>\mathcal{F}</math> and the entropy numbers. Uniform covering numbers can be controlled by the notion of Vapnik–Chervonenkis classes of sets – or shortly VC sets.

Consider a collection <math>\mathcal{C}</math> of subsets of the sample space <math>\mathcal{X}</math>. <math>\mathcal{C}</math> is said to pick out a certain subset <math>W</math> of the finite set <math>S = \{x_1,\ldots, x_n\} \subset \mathcal{X}</math> if <math>W = S \cap C</math> for some <math>C \in \mathcal{C}</math>. <math>\mathcal{C}</math> is said to shatter if it picks out each of its subsets. The VC-index (similar to VC dimension + 1 for an appropriately chosen classifier set) <math>V(\mathcal{C})</math> of <math>\mathcal{C}</math> is the smallest for which no set of size is shattered by <math>\mathcal{C}</math>.

Sauer's lemma then states that the number <math>\Delta_n(\mathcal{C}, x_1, \ldots, x_n)</math> of subsets picked out by a VC-class <math>\mathcal{C}</math> satisfies:

:<math>\max_{x_1,\ldots, x_n} \Delta_n(\mathcal{C}, x_1, \ldots, x_n) \leq \sum_{j = 0}^{V(\mathcal{C}) - 1} {n \choose j} \leq \left( \frac{n e}{V(\mathcal{C}) - 1}\right)^{V(\mathcal{C}) - 1}</math>

Which is a polynomial number <math>O(n^{V(\mathcal{C}) - 1})</math> of subsets rather than an exponential number. Intuitively this means that a finite VC-index implies that <math>\mathcal{C}</math> has an apparent simplistic structure.

A similar bound can be shown (with a different constant, same rate) for the so-called VC subgraph classes. For a function <math>f : \mathcal{X} \to \mathbf{R}</math> the subgraph is a subset of <math>\mathcal{X} \times \mathbf{R}</math> such that: <math>\{(x,t): t < f(x)\}</math>. A collection of <math>\mathcal{F}</math> is called a VC subgraph class if all subgraphs form a VC-class.

Consider a set of indicator functions <math>\mathcal{I}_{\mathcal{C = \{1_C: C \in \mathcal{C} \}</math> in <math>L_1(Q)</math> for discrete empirical type of measure (or equivalently for any probability measure ). It can then be shown that quite remarkably, for <math>r \geq 1</math>:

:<math>N(\varepsilon, \mathcal{I}_{\mathcal{C, L_r(Q)) \leq KV(\mathcal{C}) (4e)^{V(\mathcal{C})} \varepsilon^{-r (V(\mathcal{C}) - 1)}</math>

Further consider the symmetric convex hull of a set <math>\mathcal{F}</math>: <math>\operatorname{sconv}\mathcal{F}</math> being the collection of functions of the form <math>\sum_{i =1}^m \alpha_i f_i</math> with <math>\sum_{i =1}^m |\alpha_i| \leq 1</math>. Then if

:<math>N \left (\varepsilon\|F\|_{Q,2}, \mathcal{F}, L_2(Q) \right) \leq C \varepsilon^{-V}</math>

the following is valid for the convex hull of <math>\mathcal{F}</math>:

:<math>\log N \left (\varepsilon\|F\|_{Q,2}, \operatorname{sconv}\mathcal{F}, L_2(Q) \right ) \leq K \varepsilon^{-\frac{2V}{V + 2</math>

The important consequence of this fact is that

: <math>\frac{2V}{V + 2} < 2, </math>

which is just enough so that the entropy integral is going to converge, and therefore the class <math>\operatorname{sconv}\mathcal{F}</math> is going to be -Donsker.

Finally an example of a VC-subgraph class is considered. Any finite-dimensional vector space <math>\mathcal{F}</math> of measurable functions <math>f:\mathcal{X} \to \mathbf{R}</math> is VC-subgraph of index smaller than or equal to <math>\dim(\mathcal{F}) + 2</math>.

Proof: Take <math>n = \dim(\mathcal{F}) + 2 </math> points <math>(x_1, t_1), \ldots, (x_n, t_n)</math>. The vectors:

:<math>(f(x_1), \ldots, f(x_n)) - (t_1, \ldots, t_n)</math>

are in a dimensional subspace of . Take , a vector that is orthogonal to this subspace. Therefore:

:<math>\sum_{a_i > 0} a_i (f(x_i) - t_i) = \sum_{a_i < 0} (-a_i) (f(x_i) - t_i), \quad \forall f \in \mathcal{F}</math>

Consider the set <math>S = \{(x_i, t_i): a_i > 0\}</math>. This set cannot be picked out since if there is some <math>f</math> such that <math>S = \{(x_i,t_i): f(x_i) > t_i\}</math> that would imply that the LHS is strictly positive but the RHS is non-positive.

There are generalizations of the notion VC subgraph class, e.g. there is the notion of pseudo-dimension.

</references>

See references in articles: Richard M. Dudley, empirical processes, Shattered set.
This is a translation by B. Seckler, of the 1968 note.
Reprinted in
They obtained results in a draft form in July 1966 and announced in 1968 in their note
The paper was first published properly in Russian as