15 Overfitting and Regression Model Evaluation

Overfitting vs Underfitting

In machine learning, we distinguish between performance on training data $(E_{in})$ and performance on unseen data $(E_{o u t})$

Overfitting: Occurs when we switch to a model with excessive power (e.g., moving from a $2^{n d}$ order to a $1 0^{t h}$ order polynomial). The model achieves $E_{in} \approx 0$ by “memorising” the specific data points, but $E_{o u t}$ explodes because the model has captured noise instead of the true signal.
Underfitting: Occurs when the model is too restrictive (e.g., using a linear fit for a quadratic trend). Here, both $E_{in}$ and $E_{o u t}$ remain high because the model lacks “power” to learn the basic pattern.

Insight

A more complex model is not always better. Even if the true target is complex, a simpler “restricted” model often generalises better when data is limited.

Linear Regression: Mathematical Foundations

Ordinary Least Squares

The goal is to find the parameter vector $w$ that minimises the difference between predictions and actual targets.

The Normal Equation:

Φ^{T} Φw - Φ^{T} t = 0

The solution for the optimal weights is:

w_{O L S} = (Φ^{T} Φ)^{- 1} Φ^{T} y

$Φ$ (Design Matrix): Contains the basis functions evaluated at each data point.
Moore-Penrose Pseudo-inverse: The term $(Φ^{T} Φ)^{- 1} Φ^{T}$ is often computed using Singular Value Decomposition (SVD) to ensure numerical stability.

Maximum Likelihood Estimation (MLE)

From a probabilistic view, we assume the target $y$ is the model prediction plus some Gaussian noise $ϵ \sim N (0, σ^{2})$ .

Assumption: The likelihood of the data follows a Gaussian distribution.
Result: Maximising the likelihood leads to the exact same solution as OLS regression.
Drawback: MLE methods do not inherently prevent overfitting; they will fit the noise if the model is complex enough.

The Four Cause of Overfitting

Overfitting is driven by the interaction between data quality and model complexity.

Data Size ( $N$ ) $↓$ : As the number of observations decreases, the model becomes more uncertain and prone to fitting random alignments in the small sample
Stochastic Noise $↑$ : High level of random “bumps” in the data mislead the model into fitting errors.
Deterministic Noise $↑$ : If the target function is much more complex than our model, the “unfittable” part of the target acts like noise, causing the model to struggle or over-focus on local variations.
Excessive Model Power $↑$ : Using a model with a VC dimension $(d_{V C})$ relative to the amount of data allows the model to “drive too fast”, leading to accidents (overfitting)

Solution: The Bayesian Approach and Regularisation

To combat overfitting, we must either improve the data or restrict the model

1. The Bayesian Perspective (MAP)

Instead of finding a single point estimate for parameters (like MLE), the Bayesian approach infers a distribution over the parameters.

Priors: We define a distribution for the parameters before seeing data. This acts as a regulariser by “putting the brakes” on extreme parameter values.
Posterior: Quantifies the model’s certainty after observing data. As the number of observations ( $N$ ) increases, model uncertainty reduces.

2. General Cures

Regularisation: Promote lower-order models or penalize complexity.
Cross-Validation: A “bottom line” check to evaluate how well the model generalizes to unseen subsets of data.
Data Cleaning: Reducing stochastic noise (though often difficult).
Increase $N$ : Gathering more training samples is the most direct way to reduce the gap between $E_{in}$ and $E_{o u t}$ .

Ayush Acharjya's Notes

Explorer

15 Overfitting and Regression Model Evaluation

Overfitting vs Underfitting

Linear Regression: Mathematical Foundations

Ordinary Least Squares

Maximum Likelihood Estimation (MLE)

The Four Cause of Overfitting

Solution: The Bayesian Approach and Regularisation

1. The Bayesian Perspective (MAP)

2. General Cures

Graph View

Table of Contents

Backlinks