11 Bayesian Linear Regression

Frequentist vs Bayesian Paradigms

In linear regression, we seek to find the parameters $θ$ (or weights $ω$ ) that best maps predictors to targets.

View: Parameters $θ$ are fixed but unknown
Estimation: Computed using estimators like Maximum Likelihood Estimation (MLE) or Ordinary Least Squares (OLS).
Validation: Confidence is evaluated through repeated experiments or cross-validation
Assumption: MLE typically assumes additive Gaussian noise: $y - Xθ \sim Gaussian (0, σ^{2} I)$

View: Parameters $θ$ are random variables.
Estimation: We start with a prior belief and update it using a single observed dataset to form a posterior distribution.
Benefit: Naturally handles undetermined systems (e.g., trying to fit a line through only one point), where a frequentist model would have infinite solutions.

The goal is to move from a prior belief about weights to a posterior distribution after seeing data.

Prior $p (ω)$ : Our belief about the weights before seeing any data. Often assumed to be Gaussian: $p (ω) = N (ω ∣0, α^{- 1} I)$ .
Likelihood $p (y ∣ X, ω, β)$ : The probability of observing the targets $y$ given the parameters and predictors.
Posterior $p (ω ∣ y, X)$ : The updated belief after observing data.

Info

$Posterior \propto Likelihood \times Prior$

If we use a conjugate prior (a Gaussian prior for a Gaussian likelihood), the posterior is also a Gaussian Distribution

p (ω ∣ y, X) = N (ω ∣ m_{p os t}, S_{p os t})

The Posterior Parameters

Mean: $m_{p os t} = β S_{p os t} Φ^{T} y$
Covariance: $S_{p os t}^{- 1} = α I + β Φ^{T} Φ$
Note: $Φ$ is the design matrix, $α$ is the prior precision, and $β$ is the noise precision.

Bayesian regression doesn’t just give one “best” line, it gives a family of possible lines sampled from the posterior.

Increasing Observations $(N)$ : As the number of data points increases, the “spread” of possible lines narrows.
Convergence: With enough data $(N \to \infty)$ , the posterior distribution becomes very sharp, and the model uncertainty reduces significantly.

Maximum A Posteriori (MAP) estimation involves finding the weight $ω$ that maximises the posterior distribution

Maximum A Posteriori (MAP) estimation involves finding the weight $ω$ that maximises the posteriori distribution

When we take the log of the posterior with a Gaussian prior, we get a cost function to minimise

L = \frac{1}{2} i = 1 \sum N {y_{i} - ω^{T} ϕ (x_{i})}^{2} + \frac{α}{2 β} ω^{T} ω

The First Term: Sum of squared residuals (standard OLS objective).
The Second Term: The Regularisation Term ( $\frac{α}{2 β} ω^{T} ω$ ), also known as a quadratic regulariser
Intuition: This term “shrinks” the weights toward zero, which prevents the model from over-fitting to noise, especially when data is scarce.

Tip

MAP estimation for Bayesian linear regression with a Gaussian prior is identical to Regularised Least Squares or Ridge Regression.