17 Validation

Two Cures for Overfitting

Overfitting occurs when a model fits the noise in the data rather than the underlying signal. There are two primary ways to combat this:

Regularisation: “Putting on the brakes” by adding a penalty to the error function to discourage overly complex models.
Validation: “Checking the bottom line” by using a separate data set to verify how well the model generalises to unseen data.

Note

While Regularisation estimates the “overfit penalty” to adjust the model, Validation directly estimates the total out-of-sample error $(E_{o u t})$ .

The Model Selection Problem

In practice, we often have multiple models $H_{1}, H_{2}, \dots, H_{M}$ (e.g., linear, quadratic, or models with different regularisation parameters). Our goal is to select the one that will yield the lowest $E_{o u t}$

Why not use $E_{in}$ ?

Selecting a model bases on lowest in-sample error $(E_{in})$ is dangerous because $E_{in}$ is “contaminated”. Since the algorithm already used that data to pick the best hypothesis, the error rate will be optimistically biased and likely lead to overfitting.

The Ideal (but infeasible) Solution: $E_{t es t}$

The most accurate way is to select a model is using a fresh test set $(D_{t es t})$ that has never been seen by the training algorithm

Hoeffding Guarantee: $E_{out} (g_{m^{*}}) \leq E_{test} (g_{m^{*}}) + O (\frac{l o g M}{N _{test}})$ .
The Catch: True test sets are usually unavailable during the development phase.

The Validation Mechanism

Validation bridges the gap by splitting the available data $D$ (of size $N$ ) into two sets:

Training set ( $D_{train}$ ): Size $N - K$ . Used to learn the hypothesis $\overset{g}{ˉ}$ .
Validation set ( $D_{val}$ ): Size $K$ . Used to estimate $E_{out}$ .

Statistical Properties of $E_{v a l}$

The validation error is the average error over the validation set:

E_{val} (\overset{g}{ˉ}) = \frac{1}{K} x_{n} \in D_{val} \sum e (\overset{g}{ˉ} (x_{n}), y_{n})

Mean: $E_{val} (\overset{g}{ˉ})$ is an unbiased estimate of $E_{out} (\overset{g}{ˉ})$ , meaning its expected value is exactly the out-of-sample error: $E [E_{val} (\overset{g}{ˉ})] = E_{out} (\overset{g}{ˉ})$ .
Variance: For classification, the variance is bounded by $σ_{val}^{2} \leq \frac{1}{4 K}$ . As $K$ increases, the variance decreases, providing a more stable estimate.

Model Selection Process (Step-by-Step)

To pick the best model using validation, follow these steps:

Split the data: Divide your total $N$ samples into $D_{train}$ (size $N - K$ ) and $D_{val}$ (size $K$ )
Train models: For each model $H_{m}$ , run the learning algorithm on $D_{train}$ to produce a hypothesis $\overset{g}{ˉ}_{m}$ .
Evaluate: Calculate the validation error $E_{m} = E_{val} (\overset{g}{ˉ}_{m})$ for each model.
Select: Choose the model $m^{*}$ that has the minimum validation error.
Re-train (Crucial): Take the selected model $H_{m^{*}}$ and train it on the entire dataset $D$ (all $N$ points) to produce the final hypothesis $g_{m^{*}}$ .

Intuition

We re-train on all $N$ points because more data generally leads to a better model ( $E_{out} (g_{m^{*}}) \leq E_{out} (\overset{g}{ˉ}_{m^{*}})$ ).

Choosing the Validation Set Size $(K)$

Selecting $K$ involves a fundamental trade-off:

Large $K$ : Provides a very accurate estimate of $E_{out}$ , but leaves too few points for training, resulting in a poor model ( $\overset{g}{ˉ}$ ).
Small $K$ : Leaves plenty of data for training (making $\overset{g}{ˉ}$ similar to the final $g$ ), but the validation estimate becomes noisy and unreliable.
Rule of Thumb: A common practical choice is $K = N /5$ (20% of data).

Cross-Validation

When data is scarce, we use cross-validation to maximise the utility of every data point.

Leave-One-Out ( $K = 1$ )

Every single point $(x_{n}, y_{n})$ is used as a validation set once.
You train the model $N$ times, each time on $N - 1$ points.
The cross-validation error is the average of these $N$ individual errors: $E_{cv} = \frac{1}{N} \sum_{n = 1}^{N} e_{n}$ .

$V$ -Fold Cross-Validation

Divide the data into $V$ equal folds (e.g., 10 folds).
Each fold acts as the validation set once while the other $V - 1$ folds are used for training.
Heuristic: 10-fold cross-validation is generally preferred as it is less computationally expensive than Leave-One-Out but still provides a robust estimate.

Ayush Acharjya's Notes

Explorer

17 Validation

Two Cures for Overfitting

The Model Selection Problem

Why not use $E_{in}$ ?

The Ideal (but infeasible) Solution: $E_{t es t}$

The Validation Mechanism

Statistical Properties of $E_{v a l}$

Model Selection Process (Step-by-Step)

Choosing the Validation Set Size $(K)$

Cross-Validation

Leave-One-Out ( $K = 1$ )

$V$ -Fold Cross-Validation

Graph View

Table of Contents

Backlinks

Ayush Acharjya's Notes

Explorer

17 Validation

Two Cures for Overfitting

The Model Selection Problem

Why not use Ein​ ?

The Ideal (but infeasible) Solution: Etest​

The Validation Mechanism

Statistical Properties of Eval​

Model Selection Process (Step-by-Step)

Choosing the Validation Set Size (K)

Cross-Validation

Leave-One-Out (K=1)

V-Fold Cross-Validation

Graph View

Table of Contents

Backlinks

Why not use $E_{in}$ ?

The Ideal (but infeasible) Solution: $E_{t es t}$

Statistical Properties of $E_{v a l}$

Choosing the Validation Set Size $(K)$

Leave-One-Out ( $K = 1$ )

$V$ -Fold Cross-Validation