Frequentist vs Bayesian Paradigms

In linear regression, we seek to find the parameters (or weights ) that best maps predictors to targets.

Frequentist Approach (MLE & OLS)

  • View: Parameters are fixed but unknown
  • Estimation: Computed using estimators like Maximum Likelihood Estimation (MLE) or Ordinary Least Squares (OLS).
  • Validation: Confidence is evaluated through repeated experiments or cross-validation
  • Assumption: MLE typically assumes additive Gaussian noise:

Bayesian Approach

  • View: Parameters are random variables.
  • Estimation: We start with a prior belief and update it using a single observed dataset to form a posterior distribution.
  • Benefit: Naturally handles undetermined systems (e.g., trying to fit a line through only one point), where a frequentist model would have infinite solutions.

The Bayesian Framework

The goal is to move from a prior belief about weights to a posterior distribution after seeing data.

The Components

  • Prior : Our belief about the weights before seeing any data. Often assumed to be Gaussian: .
  • Likelihood : The probability of observing the targets given the parameters and predictors.
  • Posterior : The updated belief after observing data.

Info

Mathematical Solution for Posterior

If we use a conjugate prior (a Gaussian prior for a Gaussian likelihood), the posterior is also a Gaussian Distribution

The Posterior Parameters

  • Mean:
  • Covariance:
  • Note: is the design matrix, is the prior precision, and is the noise precision.

Predictive Distribution and Uncertainty

Bayesian regression doesn’t just give one “best” line, it gives a family of possible lines sampled from the posterior.

  • Increasing Observations : As the number of data points increases, the “spread” of possible lines narrows.
  • Convergence: With enough data , the posterior distribution becomes very sharp, and the model uncertainty reduces significantly.

MAP Estimation and Regularisation

Maximum A Posteriori (MAP) estimation involves finding the weight that maximises the posterior distribution

Connection to Ridge Regression

Maximum A Posteriori (MAP) estimation involves finding the weight that maximises the posteriori distribution

Connection to Ridge Regression

When we take the log of the posterior with a Gaussian prior, we get a cost function to minimise

  • The First Term: Sum of squared residuals (standard OLS objective).
  • The Second Term: The Regularisation Term (), also known as a quadratic regulariser
  • Intuition: This term “shrinks” the weights toward zero, which prevents the model from over-fitting to noise, especially when data is scarce.

Tip

MAP estimation for Bayesian linear regression with a Gaussian prior is identical to Regularised Least Squares or Ridge Regression.