14 Bias and Variance Analysis

Two Perspective on Learning: VS vs. Bias-Variance

Learning theory provides two distinct frameworks for understanding why models succeed or fail:

VC Analysis (The “Worst Case” View)

Goal: To provide a uniform bound on error that holds for any training set $D$ .
Loss Function: Typically uses 0-1 loss (classification)
Intuition: $E_{o u t} \leq E_{in} + Ω$ , where $Ω$ is a penalty for model complexity.

Bias-Variance Analysis (The “Average Case” View)

Goal: To decompose the average out-of-sample error across all possible training sets.
Loss Function: Uses squared error loss, as its differentiability allows for cleaner mathematical decomposition
Applicability: Primarily used for real-valued target functions.

Mathematical Decomposition of Error

To quantify the tradeoff, we assume the existence of an average hypothesis $\overset{g}{ˉ} (x)$ , which is what you would get if you trained on an infinite number of different datasets.

The Average Hypothesis

\overset{g}{ˉ} (x) = E_{D} [g^{(D)} (x)] \approx \frac{1}{K} k = 1 \sum K g^{(D_{k})} (x)

The Three Components of $E_{o u t}$

The expected out-of-sample error at a point $x$ can be broken down into three distinct parts

Bias: $(\overset{g}{ˉ} (x) - f (x))^{2}$ . This is how far the “average” prediction is from the truth.
Variance: $E_{D} [(g^{(D)} (x) - \overset{g}{ˉ} (x))^{2}]$ . This is the fluctuation of individual models around their average.
Noise ( $σ^{2}$ ): $E_{x, ϵ} [ϵ (x)^{2}]$ . This is the inherent randomness in the target distribution $P (y ∣ x)$ .

Intuition

Think of a dartboard. Low Bias/Low Var is a tight cluster at the bullseye. High Bias/Low Var is a tight cluster far from the bullseye. Low Bias/High Var is a loose cluster centered around the bullseye

The Complexity Tradeoff

Model complexity directly impacts the balance between bias and variance:
Simple Models ( $↓$ Complexity):

Have High Bias (cannot represent complex target functions).
Have Low Variance (very stable; predictions don’t change much with different data).

Complex Models ( $↑$ Complexity):

Have Low Bias (can fit almost any pattern).
Have High Variance (highly sensitive to noise; “behave wildly” after seeing specific data).

Example: Learning $sin (π x)$ with $N = 2$

Consider fitting two points from a sine wave:

Model 0 ( $H_{0}$ - Constant line): High bias (0.50) but low variance (0.25). Resulting $E_{o u t} = 0.75$ .
Model 1 ( $H_{1}$ - Linear line): Lower bias (0.21) but massive variance (1.69). Resulting $E_{o u t} = 1.90$ .
Insight: In this small-data scenario, the simpler model actually performs better because its variance is so much lower.

Learning Curves

Learning curve tracks how $E_{in}$ and $E_{o u t}$ change as the number of training points $N$ increases

As $N \to \infty$ : Both $E_{in}$ and $E_{o u t}$ converge toward the noise level $σ^{2}$ .
Linear Regression Case:
- $E_{in} \approx σ^{2} (1 - \frac{d + 1}{N})$
- $E_{o u t} \approx σ^{2} (1 + \frac{d + 1}{N})$
- Generalisation Error $\approx 2 σ^{2} (\frac{d + 1}{N})$ .

Ayush Acharjya's Notes

Explorer

14 Bias and Variance Analysis

Two Perspective on Learning: VS vs. Bias-Variance

VC Analysis (The “Worst Case” View)

Bias-Variance Analysis (The “Average Case” View)

Mathematical Decomposition of Error

The Average Hypothesis

The Three Components of $E_{o u t}$

The Complexity Tradeoff

Example: Learning $sin (π x)$ with $N = 2$

Learning Curves

Graph View

Table of Contents

Backlinks

Ayush Acharjya's Notes

Explorer

14 Bias and Variance Analysis

Two Perspective on Learning: VS vs. Bias-Variance

VC Analysis (The “Worst Case” View)

Bias-Variance Analysis (The “Average Case” View)

Mathematical Decomposition of Error

The Average Hypothesis

The Three Components of Eout​

The Complexity Tradeoff

Example: Learning sin(πx) with N=2

Learning Curves

Graph View

Table of Contents

Backlinks

The Three Components of $E_{o u t}$

Example: Learning $sin (π x)$ with $N = 2$