Overfitting vs Underfitting
In machine learning, we distinguish between performance on training data and performance on unseen data
- Overfitting: Occurs when we switch to a model with excessive power (e.g., moving from a order to a order polynomial). The model achieves by “memorising” the specific data points, but explodes because the model has captured noise instead of the true signal.
- Underfitting: Occurs when the model is too restrictive (e.g., using a linear fit for a quadratic trend). Here, both and remain high because the model lacks “power” to learn the basic pattern.
Insight
A more complex model is not always better. Even if the true target is complex, a simpler “restricted” model often generalises better when data is limited.
Linear Regression: Mathematical Foundations
Ordinary Least Squares
The goal is to find the parameter vector that minimises the difference between predictions and actual targets.
The Normal Equation:
The solution for the optimal weights is:
- (Design Matrix): Contains the basis functions evaluated at each data point.
- Moore-Penrose Pseudo-inverse: The term is often computed using Singular Value Decomposition (SVD) to ensure numerical stability.
Maximum Likelihood Estimation (MLE)
From a probabilistic view, we assume the target is the model prediction plus some Gaussian noise .
- Assumption: The likelihood of the data follows a Gaussian distribution.
- Result: Maximizing the likelihood leads to the exact same solution as OLS regression.
- Drawback: MLE methods do not inherently prevent overfitting; they will fit the noise if the model is complex enough.
The Four Cause of Overfitting
Overfitting is driven by the interaction between data quality and model complexity.
- Data Size () : As the number of observations decreases, the model becomes more uncertain and prone to fitting random alignments in the small sample
- Stochastic Noise : High level of random “bumps” in the data mislead the model into fitting errors.
- Deterministic Noise : If the target function is much more complex that our model, the “unifiable” part of the target acts like noise, causing the model to struggle or over-focus on local variations.
- Excessive Model Power : Using a model with a VC dimension relative to the amount of data allows the model to “drive too fast”, leading to accidents (overfitting)
Solution: The Bayesian Approach and Regularisation
To combat overfitting, we must either improve the data or restrict the model
1. The Bayesian Perspective (MAP)
Instead of finding a single point estimate for parameters (like MLE), the Bayesian approach infers a distribution over the parameters.
- Priors: We define a distribution for the parameters before seeing data. This acts as a regulariser by “putting the brakes” on extreme parameter values.
- Posterior: Quantifies the model’s certainty after observing data. As the number of observations () increases, model uncertainty reduces.
2. General Cures
- Regularisation: Promote lower-order models or penalize complexity.
- Cross-Validation: A “bottom line” check to evaluate how well the model generalizes to unseen subsets of data.
- Data Cleaning: Reducing stochastic noise (though often difficult).
- Increase : Gathering more training samples is the most direct way to reduce the gap between and .