Two Cures for Overfitting

Overfitting occurs when a model fits the noise in the data rather than the underlying signal. There are two primary ways to combat this:

  1. Regularisation: β€œPutting on the brakes” by adding a penalty to the error function to discourage overly complex models.
  2. Validation: β€œChecking the bottom line” by using a separate data set to verify how well the model generalises to unseen data.

Note

While Regularisation estimates the β€œoverfit penalty” to adjust the model, Validation directly estimates the total out-of-sample error .

The Model Selection Problem

In practice, we often have multiple models (e.g., linear, quadratic, or models with different regularisation parameters). Our goal is to select the one that will yield the lowest

Why not use ?

Selecting a model bases on lowest in-sample error is dangerous because is β€œcontaminated”. Since the algorithm already used that data to pick the best hypothesis, the error rate will be optimistically biased and likely lead to overfitting.

The Ideal (but infeasible) Solution:

The most accurate way is to select a model is using a fresh test set that has never been seen by the training algorithm

  • Hoeffding Guarantee: .
  • The Catch: True test sets are usually unavailable during the development phase.

The Validation Mechanism

Validation bridges the gap by splitting the available data (of size ) into two sets:

  • Training set (): Size . Used to learn the hypothesis .
  • Validation set (): Size . Used to estimate .

Statistical Properties of

The validation error is the average error over the validation set:

  • Mean: is an unbiased estimate of , meaning its expected value is exactly the out-of-sample error: .
  • Variance: For classification, the variance is bounded by . As increases, the variance decreases, providing a more stable estimate.

Model Selection Process (Step-by-Step)

To pick the best model using validation, follow these steps:

  1. Split the data: Divide your total samples into (size ) and (size )
  2. Train models: For each model , run the learning algorithm on to produce a hypothesis .
  3. Evaluate: Calculate the validation error for each model.
  4. Select: Choose the model that has the minimum validation error.
  5. Re-train (Crucial): Take the selected model and train it on the entire dataset (all points) to produce the final hypothesis .

Intuition

We re-train on all points because more data generally leads to a better model ().

Choosing the Validation Set Size

Selecting involves a fundamental trade-off:

  • Large : Provides a very accurate estimate of , but leaves too few points for training, resulting in a poor model ().
  • Small : Leaves plenty of data for training (making similar to the final ), but the validation estimate becomes noisy and unreliable.
  • Rule of Thumb: A common practical choice is (20% of data).

Cross-Validation

When data is scarce, we use cross-validation to maximise the utility of every data point.

Leave-One-Out ()

  • Every single point is used as a validation set once.
  • You train the model times, each time on points.
  • The cross-validation error is the average of these individual errors: .

-Fold Cross-Validation

  • Divide the data into equal folds (e.g., 10 folds).
  • Each fold acts as the validation set once while the other folds are used for training.
  • Heuristic: 10-fold cross-validation is generally preferred as it is less computationally expensive than Leave-One-Out but still provides a robust estimate.