10 Bayesian Inference and Probabilistic Linear Regression

Paradigms of Machine Learning

Machine Learning seeks to determine the “best fit” for data, such as deciding the optimal degree of a polynomial regression. There are two primary schools of thought:

Frequentist View

Parameters $(θ)$ : Considered fixed, unknown constants.
Methodology: Uses estimators like MLE, confidence is gained through repeated experiments (e.g., cross-validation)

Bayesian View

Parameters $(θ)$ : Treated as random variables.
Methodology: Uncertainty in $θ$ is expressed as a probability distribution based on a single observed dataset.
Bayes’ Law: Used to update the beliefs as more data arrives:

p (θ ∣ X) = \frac{p ( X ∣ θ ) p ( θ )}{p ( X )}

Essentially: Posterior $\propto$ Likelihood $\times$ Prior.

The Gaussian Distribution

The Gaussian (Normal) distribution is ubiquitous due to the Central Limit Theorem, which states that the sum of many independent variables tends towards a Gaussian shape.

Univariate Gaussian

Defined by mean $(μ)$ and variance $(σ^{2})$ :

N (x ∣ μ, σ^{2}) = \frac{1}{2 π σ ^{2}} e^{- \frac{1}{2 σ ^{2}} (x - μ)^{2}}

Multivariate Gaussian

For an $N$ -dimensional vector $x$ , the distribution uses a mean vector $μ$ and an $N \times N$ covariance matrix $Σ$ :

N (x ∣ μ, Σ) = \frac{1}{( 2 π ) ^{N /2} ∣ Σ ∣ ^{1/2}} exp {- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ)}

Tip

Mahalanobis Distance: The term $Δ^{2} = (x - μ)^{T} Σ^{- 1} (x - μ)$ represents the distance of a point from the mean, accounting for correlations in the data. It simplifies to Euclidean distance $Σ$ is the identity matrix.

Maximum Likelihood Estimation (MLE)

MLE aims to find parameters ( $μ, Σ$ ) that make the observed dataset $X$ most probable.

The Process

Likelihood Function: Assuming data points are independent and identically distributed (i.i.d), the likelihood is the product of individual probabilities: $p (X ∣ μ, Σ) = \prod_{i = 1}^{N} p (x_{i} ∣ μ, Σ)$
Log-Likelihood $L$ ): We take the natural log because it transforms products into sums, making the differentiation much easier
Optimisation: Set the partial derivatives of $L$ with respect to $μ$ and $Σ$ to zero

Results for Gaussian MLE:

$μ_{M L} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}$ (The sample mean).
$Σ_{M L} = \frac{1}{N} \sum_{i = 1}^{N} (x_{i} - μ) (x_{i} - μ)^{T}$ (The sample covariance).

Linear Regression: OLS vs Probabilistic View

Ordinary Least Squares (OLS)

OLS solves for weights $ω$ by minimising sum of squared residuals.

Design Matrix ( $Φ$ ): A matrix where each row represents the basis functions of a data point.
Normal Equation: $ω_{O L S} = (Φ^{T} Φ)^{- 1} Φ^{T} y$ .
Moore-Penrose Pseudo-inverse: The term $(Φ^{T} Φ)^{- 1} Φ^{T}$ is denoted as $Φ^{†}$ .

The Probabilistic View

We assume the target $y$ is a linear combination of inputs plus some unexplained Gaussian noise.

y = \overset{y}{^} (x, ω) + ϵ

Where $ϵ \sim N (0, β^{- 1})$ and $β$ is the precision (inverse variance).
This leads to a conditional target distribution

p (y ∣ x, ω, β) = N (y ∣ ω^{T} ϕ (x), β^{- 1})

The Equivalence: Why MLE = OLS

Under the assumption of additive Gaussian nose, maximising the likelihood of the regression model is mathematically identical to minimising the sum of squared error in OLS.

Step-by-Step Intuition

Write the log-likelihood $L$ for the regression model.
Observe that the core term in $L$ dependent on weights $ω$ is the sum of squared residuals: $R (ω) = \sum (y_{i} - \overset{y}{^}_{i})^{2}$ .
Because there is a negative sign in front of this term in the log-likelihood, maximising $L$ requires minimising $R (ω)$ .
Therefore, $ω_{M L E} = ω_{O L S}$ .

IMPORTANT

OLS requires no specific distribution assumptions to function as an optimisation tool, but it only gains an MLE interpretation if we assume the noise follows a Gaussian distribution.

Ayush Acharjya's Notes

Explorer

10 Bayesian Inference and Probabilistic Linear Regression

Paradigms of Machine Learning

Frequentist View

Bayesian View

The Gaussian Distribution

Univariate Gaussian

Multivariate Gaussian

Maximum Likelihood Estimation (MLE)

The Process

Linear Regression: OLS vs Probabilistic View

Ordinary Least Squares (OLS)

The Probabilistic View

The Equivalence: Why MLE = OLS

Step-by-Step Intuition

Graph View

Table of Contents

Backlinks