Two Perspective on Learning: VS vs. Bias-Variance

Learning theory provides two distinct frameworks for understanding why models succeed or fail:

VC Analysis (The “Worst Case” View)

  • Goal: To provide a uniform bound on error that holds for any training set .\
  • Loss Function: Typically uses 0-1 loss (classification)
  • Intuition: , where is a penalty for model complexity.

Bias-Variance Analysis (The “Average Case” View)

  • Goal: To decompose the average out-of-sample error across all possible training sets.
  • Loss Function: Uses squared error loss, as its differentiability allows for cleaner mathematical decomposition
  • Applicability: Primarily used for real-valued target functions.

Mathematical Decomposition of Error

To quantify the tradeoff, we assume the existence of an average hypothesis , which is what you would get if you trained on an infinite number of different datasets.

The Average Hypothesis

The Three Components of

The expected out-of-sample error at a point can be broken down into three distinct parts

  1. Bias: . This is how far the “average” prediction is from the truth.
  2. Variance: . This is the fluctuation of individual models around their average.
  3. Noise (): . This is the inherent randomness in the target distribution .

Intuition

Think of a dartboard. Low Bias/Low Var is a tight cluster at the bullseye. High Bias/Low Var is a tight cluster far from the bullseye. Low Bias/High Var is a loose cluster centered around the bullseye

The Complexity Tradeoff

Model complexity directly impacts the balance between bias and variance:

  • Simple Models ( Complexity):
  • Have High Bias (cannot represent complex target functions).
  • Have Low Variance (very stable; predictions don’t change much with different data).
  • Complex Models ( Complexity):
  • Have Low Bias (can fit almost any pattern).
  • Have High Variance (highly sensitive to noise; “behave wildly” after seeing specific data).

Example: Learning with

Consider fitting two points from a sine wave:

  • Model 0 ( - Constant line): High bias (0.50) but low variance (0.25). Resulting .
  • Model 1 ( - Linear line): Lower bias (0.21) but massive variance (1.69). Resulting .
  • Insight: In this small-data scenario, the simpler model actually performs better because its variance is so much lower.

Learning Curves

Learning curve tracks how and change as the number of training points increases

  • As : Both and converge toward the noise level .
  • Linear Regression Case:
    • Generalisation Error .