16 Regularisation

The Intuition: Controlling Model Complexity

In machine learning, we often face a trade-off between fitting the training data $(E_{in})$ and ensuring the model performs on new data $(E_{o u t})$

Free Fit: A higher-order model (e.g., 10th-order polynomial) might pass through every data point but creates a “wild” curve that misses the true target.
Restrained Fit: By “stepping back” from a complex hypothesis set (e.g., $H_{10}$ ) to a simpler one (e.g., $H_{2}$ ), we achieve a smoother fit that better represents the underlying trend.

Tip

Regularisation or Validation: Regularisation is the “brake” the prevents overfitting during training. Validation is the “bottom line” check to see if the brakes worked

Mathematical Foundations: VC Analysis

From the Vapnik-Chervonenkis (VC) Bound, we know:

E_{o u t} (h) \leq E_{in} (h) + Ω (H)

Where $Ω (H)$ represents the complexity of the hypothesis set. Regularisation changes the objective from minimising $E_{in} (h)$ alone to minimising a combination of fit and complexity:

h min E_{in} (h) + Ω (h)

By minimising the complexity of a single hypothesis $Ω (h)$ , we effectively control the complexity of the entire search space.

Regularisation as Constrained Optimisation

Instead of forcing weights to be zero (as hard constraint), we use a “soft” constraint on the size of the weight vector.

The Hard Constraint

We limit the budget of the weights using a hypersphere of radius $C$

min E_{in} (w) subject to: w^{T} w \leq C

Solving via Lagrange Multiplers

At the optimal solution $w_{REG}$ , the gradient of the error function $- \nabla E_{in}$ must be parallel to the normal vector of the constraint boundary (which is $w$ ). If they are not parallel, we could still move along the boundary to decrease $E_{in}$ without violating the constraint.
This leads to the Augmented Error $(E_{a ug})$ formula:

E_{a ug} (w) = E_{in} (w) + \frac{λ}{N} w^{T} w

$λ$ (Lambda): The regularisation parameter. It controls the trade-off.
Large $λ$ : High penalty for large weights; leads to underfitting.
Small $λ$ : Low penalty; leads to overfitting.

Types of Regulariser

L2 Regularisation (Ridge Regression)

Formula: $Ω (w) = \sum w_{q}^{2} = ∥ w ∥_{2}^{2}$
Characteristics: Convex and differentiable everywhere. Easy to optimise
Effect: Shrinks all weights uniformly (Weight Decay)
Analytical Solution:

w_{re g} = (Z^{T} Z + λ I)^{- 1} Z^{T} y

The addition of $λ I$ ensures the matrix is invertible, even with collinearity.

L1 Regulariser (LASSO)

Formula: $Ω (w) = \sum ∣ w_{q} ∣ = ∥ w ∥_{1}$
Characteristics: Convex but not differentiable at $w = 0$ .
Effect: Promotes sparsity. It forces many coefficients to be exactly zero.
Usage: Best when you suspect only a few features are actually relevant (e.g., fitting a 50th-order polynomial to a 3rd-order process).

NOTE

Why Sparsity? Geometrically, the L1 constraint is a “diamond.” The contours of $E_{in}$ are likely to hit the corners of this diamond first, which lie on the axes where some coordinates are zero.

Bayesian: Perspective: MAP Estimation

Regularisation can be viewed as incorporating prior knowledge into the model.

MLE (Maximum Likelihood): Assumes all weight values are equally likely.
MAP (Maximum A Posteriori): Assumes weights follow a distribution (Prior).
- A Gaussian Prior on weights results in L2 (Ridge) regularisation.
- A Laplacian Prior results in L1 (LASSO) regularisation. The posterior distribution is proportional to:

Posterior \propto Likelihood \times Prior

Maximising the log-posterior is mathematically equivalent to minimising the augmented error $E_{a ug}$ .

Practical Tips & Common Mistakes

How to Choose $λ$ ?

More Noise = More Regularisation: If your data is “bumpy” (stochastic or deterministic noise), you need “harder brakes” (larger $λ$ ).
Validation: Never choose $λ$ based on training error. Use a validation set or cross-validation to find the $λ$ that minimises $E_{o u t}$ .

Common Mistakes

Applying too much $λ$ : This causes the model to ignore the data entirely, leading to a flat-line (underfitting).
Regularising the Bias ( $w_{0}$ ): Usually, we do not regularise the intercept term, as it doesn’t contribute to the “wiggliness” of the curve.
Ignoring Feature Scaling: Regularisation is sensitive to the scale of features. Always standardise your data before applying L1 or L2.

Ayush Acharjya's Notes

Explorer

16 Regularisation

The Intuition: Controlling Model Complexity

Mathematical Foundations: VC Analysis

Regularisation as Constrained Optimisation

The Hard Constraint

Solving via Lagrange Multiplers

Types of Regulariser

L2 Regularisation (Ridge Regression)

L1 Regulariser (LASSO)

Bayesian: Perspective: MAP Estimation

Practical Tips & Common Mistakes

How to Choose $λ$ ?

Common Mistakes

Graph View

Table of Contents

Backlinks

Ayush Acharjya's Notes

Explorer

16 Regularisation

The Intuition: Controlling Model Complexity

Mathematical Foundations: VC Analysis

Regularisation as Constrained Optimisation

The Hard Constraint

Solving via Lagrange Multiplers

Types of Regulariser

L2 Regularisation (Ridge Regression)

L1 Regulariser (LASSO)

Bayesian: Perspective: MAP Estimation

Practical Tips & Common Mistakes

How to Choose λ?

Common Mistakes

Graph View

Table of Contents

Backlinks

How to Choose $λ$ ?