Overfitting vs Underfitting

In machine learning, we distinguish between performance on training data and performance on unseen data

  • Overfitting: Occurs when we switch to a model with excessive power (e.g., moving from a order to a order polynomial). The model achieves by “memorising” the specific data points, but explodes because the model has captured noise instead of the true signal.
  • Underfitting: Occurs when the model is too restrictive (e.g., using a linear fit for a quadratic trend). Here, both and remain high because the model lacks “power” to learn the basic pattern.

Insight

A more complex model is not always better. Even if the true target is complex, a simpler “restricted” model often generalises better when data is limited.

Linear Regression: Mathematical Foundations

Ordinary Least Squares

The goal is to find the parameter vector that minimises the difference between predictions and actual targets.

The Normal Equation:

The solution for the optimal weights is:

  • (Design Matrix): Contains the basis functions evaluated at each data point.
  • Moore-Penrose Pseudo-inverse: The term is often computed using Singular Value Decomposition (SVD) to ensure numerical stability.

Maximum Likelihood Estimation (MLE)

From a probabilistic view, we assume the target is the model prediction plus some Gaussian noise .

  • Assumption: The likelihood of the data follows a Gaussian distribution.
  • Result: Maximizing the likelihood leads to the exact same solution as OLS regression.
  • Drawback: MLE methods do not inherently prevent overfitting; they will fit the noise if the model is complex enough.

The Four Cause of Overfitting

Overfitting is driven by the interaction between data quality and model complexity.

  1. Data Size () : As the number of observations decreases, the model becomes more uncertain and prone to fitting random alignments in the small sample
  2. Stochastic Noise : High level of random “bumps” in the data mislead the model into fitting errors.
  3. Deterministic Noise : If the target function is much more complex that our model, the “unifiable” part of the target acts like noise, causing the model to struggle or over-focus on local variations.
  4. Excessive Model Power : Using a model with a VC dimension relative to the amount of data allows the model to “drive too fast”, leading to accidents (overfitting)

Solution: The Bayesian Approach and Regularisation

To combat overfitting, we must either improve the data or restrict the model

1. The Bayesian Perspective (MAP)

Instead of finding a single point estimate for parameters (like MLE), the Bayesian approach infers a distribution over the parameters.

  • Priors: We define a distribution for the parameters before seeing data. This acts as a regulariser by “putting the brakes” on extreme parameter values.
  • Posterior: Quantifies the model’s certainty after observing data. As the number of observations () increases, model uncertainty reduces.

2. General Cures

  • Regularisation: Promote lower-order models or penalize complexity.
  • Cross-Validation: A “bottom line” check to evaluate how well the model generalizes to unseen subsets of data.
  • Data Cleaning: Reducing stochastic noise (though often difficult).
  • Increase : Gathering more training samples is the most direct way to reduce the gap between and .