The Intuition: Controlling Model Complexity

In machine learning, we often face a trade-off between fitting the training data and ensuring the model performs on new data

  • Free Fit: A higher-order model (e.g., 10th-order polynomial) might pass through every data point but creates a “wild” curve that misses the true target.
  • Restrained Fit: By “stepping back” from a complex hypothesis set (e.g., ) to a simpler one (e.g., ), we achieve a smoother fit that better represents the underlying trend.

Tip

Regularisation or Validation: Regularisation is the “brake” the prevents overfitting during training. Validation is the “bottom line” check to see if the brakes worked

Mathematical Foundations: VC Analysis

From the Vapnik-Chervonenkis (VC) Bound, we know:

Where represents the complexity of the hypothesis set. Regularisation changes the objective from minimising alone to minimising a combination of fit and complexity:

By minimising the complexity of a single hypothesis , we effectively control the complexity of the entire search space.

Regularisation as Constrained Optimisation

Instead of forcing weights to be zero (as hard constraint), we use a “soft” constraint on the size of the weight vector.

The Hard Constraint

We limit the budget of the weights using a hypersphere of radius

Solving via Lagrange Multiplers

At the optimal solution , the gradient of the error function must be parallel to the normal vector of the constraint boundary (which is ). If they are not parallel, we could still move along the boundary to decrease without violating the constraint.
This leads to the Augmented Error formula:

  • (Lambda): The regularisation parameter. It controls the trade-off.
  • Large : High penalty for large weights; leads to underfitting.
  • Small : Low penalty; leads to overfitting.

Types of Regulariser

L2 Regularisation (Ridge Regression)

  • Formula:
  • Characteristics: Convex and differentiable everywhere. Easy to optimise
  • Effect: Shrinks all weights uniformly (Weight Decay)
  • Analytical Solution:

The addition of ensures the matrix is invertible, even with collinearity.

L1 Regulariser (LASSO)

  • Formula:
  • Characteristics: Convex but not differentiable at .
  • Effect: Promotes sparsity. It forces many coefficients to be exactly zero.
  • Usage: Best when you suspect only a few features are actually relevant (e.g., fitting a 50th-order polynomial to a 3rd-order process).

NOTE

Why Sparsity? Geometrically, the L1 constraint is a “diamond.” The contours of are likely to hit the corners of this diamond first, which lie on the axes where some coordinates are zero.

Bayesian: Perspective: MAP Estimation

Regularisation can be viewed as incorporating prior knowledge into the model.

  • MLE (Maximum Likelihood): Assumes all weight values are equally likely.
  • MAP (Maximum A Posteriori): Assumes weights follow a distribution (Prior).
    • A Gaussian Prior on weights results in L2 (Ridge) regularisation.
    • A Laplacian Prior results in L1 (LASSO) regularisation. The posterior distribution is proportional to:

Maximising the log-posterior is mathematically equivalent to minimising the augmented error .

Practical Tips & Common Mistakes

How to Choose ?

  • More Noise = More Regularisation: If your data is “bumpy” (stochastic or deterministic noise), you need “harder brakes” (larger ).
  • Validation: Never choose based on training error. Use a validation set or cross-validation to find the that minimises .

Common Mistakes

  • Applying too much : This causes the model to ignore the data entirely, leading to a flat-line (underfitting).
  • Regularising the Bias (): Usually, we do not regularise the intercept term, as it doesn’t contribute to the “wiggliness” of the curve.
  • Ignoring Feature Scaling: Regularisation is sensitive to the scale of features. Always standardise your data before applying L1 or L2.