Moving Beyond Finite Hypothesis Sets
The foundational problem in learning theory is balancing two questions:
- Can we ensure is close to ?
- Can we make small enough ?
Previously, we used Hoeffding’s Inequality, which relied on (the number of hypothesis). However, for most real-world (like lines in a plane), , which makes the standard bound useless.
The Union Bound Problem
The factor comes from the Union Bound, which assumes the “bad events” (where and diverge) for different hypothesis are non-overlapping. In reality, similar hypotheses overlap significantly, meaning we overcount the “badness”. To fix this, we need a way to group similar hypotheses and count the effective number of choices.
Dichotomies and the Growth Function
Instead of counting every possible line in a space, we count how many different ways those lines can classify a specific set of points
- Dichotomy: A “mini-hypothesis” that only consider the labels assigned to the training data points
- While there are infinite hypotheses, there are at most probabilities dichotomies for points.
The Growth Function
To remove the dependence on specific point locations, we define the growth function as the maximum number of dichotomies the hypothesis set can create for any points:
Tip
If a hypothesis set can produce all possible labelings for a set of points, we say it shatters those points.
Defining VC Dimension
The Vapnik-Chervonenkis (VC) Dimension is the mathematical bridge that allows is to handle infinite models.
Definition: is the largest for which
- It is the most points can shatter
- If the Break Point is , then
Examples of
- 2D Perceptron: (can shatter 3 points, but the break point is 4).
- 2D Rectangle: .
- General Perceptron: For a -dimensional space, (the accounts for the bias term).
The VC Generalisation Bound
The VC Dimensional replaces in the generalisation inequalities. This is the VC Inequality:
The square root term is the **Penalty for Model Complexity
- High : will be low (the model is powerful), but will be high (high risk of overfitting).
- Low : will be low (good generalisation), but might be high (the model is too simple).
Practical Sample Complexity
How many data points do we actually need for a model to generalise ?
- Theoretical Bound: For a given and , theory often suggests .
- Practical Rule of Thumb: In practice, is often sufficient to achieve good results.