Frequentist vs Bayesian Paradigms
In linear regression, we seek to find the parameters (or weights ) that best maps predictors to targets.
Frequentist Approach (MLE & OLS)
- View: Parameters are fixed but unknown
- Estimation: Computed using estimators like Maximum Likelihood Estimation (MLE) or Ordinary Least Squares (OLS).
- Validation: Confidence is evaluated through repeated experiments or cross-validation
- Assumption: MLE typically assumes additive Gaussian noise:
Bayesian Approach
- View: Parameters are random variables.
- Estimation: We start with a prior belief and update it using a single observed dataset to form a posterior distribution.
- Benefit: Naturally handles undetermined systems (e.g., trying to fit a line through only one point), where a frequentist model would have infinite solutions.
The Bayesian Framework
The goal is to move from a prior belief about weights to a posterior distribution after seeing data.
The Components
- Prior : Our belief about the weights before seeing any data. Often assumed to be Gaussian: .
- Likelihood : The probability of observing the targets given the parameters and predictors.
- Posterior : The updated belief after observing data.
Info
Mathematical Solution for Posterior
If we use a conjugate prior (a Gaussian prior for a Gaussian likelihood), the posterior is also a Gaussian Distribution
The Posterior Parameters
- Mean:
- Covariance:
- Note: is the design matrix, is the prior precision, and is the noise precision.
Predictive Distribution and Uncertainty
Bayesian regression doesn’t just give one “best” line, it gives a family of possible lines sampled from the posterior.
- Increasing Observations : As the number of data points increases, the “spread” of possible lines narrows.
- Convergence: With enough data , the posterior distribution becomes very sharp, and the model uncertainty reduces significantly.
MAP Estimation and Regularisation
Maximum A Posteriori (MAP) estimation involves finding the weight that maximises the posterior distribution
Connection to Ridge Regression
Maximum A Posteriori (MAP) estimation involves finding the weight that maximises the posteriori distribution
Connection to Ridge Regression
When we take the log of the posterior with a Gaussian prior, we get a cost function to minimise
- The First Term: Sum of squared residuals (standard OLS objective).
- The Second Term: The Regularisation Term (), also known as a quadratic regulariser
- Intuition: This term “shrinks” the weights toward zero, which prevents the model from over-fitting to noise, especially when data is scarce.
Tip
MAP estimation for Bayesian linear regression with a Gaussian prior is identical to Regularised Least Squares or Ridge Regression.