Paradigms of Machine Learning

Machine Learning seeks to determine the “best fit” for data, such as deciding the optimal degree of a polynomial regression. There are two primary schools of thought:

Frequentist View

  • Parameters : Considered fixed, unknown constants.
  • Methodology: Uses estimators like MLE, confidence is gained through repeated experiments (e.g., cross-validation)

Bayesian View

  • Parameters : Treated as random variables.
  • Methodology: Uncertainty in is expressed as a probability distribution based on a single observed dataset.
  • Bayes’ Law: Used to update the beliefs as more data arrives:

Essentially: Posterior Likelihood Prior.

The Gaussian Distribution

The Gaussian (Normal) distribution is ubiquitous due to the Central Limit Theorem, which states that the sum of many independent variables tends towards a Gaussian shape.

Univariate Gaussian

Defined by mean and variance :

Multivariate Gaussian

For an -dimensional vector , the distribution uses a mean vector and an covariance matrix :

Tip

Mahalanobis Distance: The term represents the distance of a point from the mean, accounting for correlations in the data. It simplifies to Euclidean distance is the identity matrix.

Maximum Likelihood Estimation (MLE)

MLE aims to find parameters () that make the observed dataset most probable.

The Process

  1. Likelihood Function: Assuming data points are independent and identically distributed (i.i.d), the likelihood is the product of individual probabilities:
  2. Log-Likelihood ): We take the natural log because it transforms products into sums, making the differentiation much easier
  3. Optimisation: Set the partial derivatives of with respect to and to zero

Results for Gaussian MLE:

  • (The sample mean).
  • (The sample covariance).

Linear Regression: OLS vs Probabilistic View

Ordinary Least Squares (OLS)

OLS solves for weights by minimising sum of squared residuals.

  • Design Matrix (): A matrix where each row represents the basis functions of a data point.
  • Normal Equation: .
  • Moore-Penrose Pseudo-inverse: The term is denoted as .

The Probabilistic View

We assume the target is a linear combination of inputs plus some unexplained Gaussian noise.

Where and is the precision (inverse variance).
This leads to a conditional target distribution

The Equivalence: Why MLE = OLS

Under the assumption of additive Gaussian nose, maximising the likelihood of the regression model is mathematically identical to minimising the sum of squared error in OLS.

Step-by-Step Intuition

  1. Write the log-likelihood for the regression model.
  2. Observe that the core term in dependent on weights is the sum of squared residuals: .
  3. Because there is a negative sign in front of this term in the log-likelihood, maximising requires minimising .
  4. Therefore, .

IMPORTANT

OLS requires no specific distribution assumptions to function as an optimisation tool, but it only gains an MLE interpretation if we assume the noise follows a Gaussian distribution.