Two Perspective on Learning: VS vs. Bias-Variance
Learning theory provides two distinct frameworks for understanding why models succeed or fail:
VC Analysis (The “Worst Case” View)
- Goal: To provide a uniform bound on error that holds for any training set .\
- Loss Function: Typically uses 0-1 loss (classification)
- Intuition: , where is a penalty for model complexity.
Bias-Variance Analysis (The “Average Case” View)
- Goal: To decompose the average out-of-sample error across all possible training sets.
- Loss Function: Uses squared error loss, as its differentiability allows for cleaner mathematical decomposition
- Applicability: Primarily used for real-valued target functions.
Mathematical Decomposition of Error
To quantify the tradeoff, we assume the existence of an average hypothesis , which is what you would get if you trained on an infinite number of different datasets.
The Average Hypothesis
The Three Components of
The expected out-of-sample error at a point can be broken down into three distinct parts
- Bias: . This is how far the “average” prediction is from the truth.
- Variance: . This is the fluctuation of individual models around their average.
- Noise (): . This is the inherent randomness in the target distribution .
Intuition
Think of a dartboard. Low Bias/Low Var is a tight cluster at the bullseye. High Bias/Low Var is a tight cluster far from the bullseye. Low Bias/High Var is a loose cluster centered around the bullseye
The Complexity Tradeoff
Model complexity directly impacts the balance between bias and variance:
- Simple Models ( Complexity):
- Have High Bias (cannot represent complex target functions).
- Have Low Variance (very stable; predictions don’t change much with different data).
- Complex Models ( Complexity):
- Have Low Bias (can fit almost any pattern).
- Have High Variance (highly sensitive to noise; “behave wildly” after seeing specific data).
Example: Learning with
Consider fitting two points from a sine wave:
- Model 0 ( - Constant line): High bias (0.50) but low variance (0.25). Resulting .
- Model 1 ( - Linear line): Lower bias (0.21) but massive variance (1.69). Resulting .
- Insight: In this small-data scenario, the simpler model actually performs better because its variance is so much lower.
Learning Curves
Learning curve tracks how and change as the number of training points increases
- As : Both and converge toward the noise level .
- Linear Regression Case:
- Generalisation Error .