Fundamentals of Regression

Regression is a supervised learning task where the goal is to predict a continuous numerical value (e.g., a credit line amount) based on input features. This differs from classification, which predicts discrete categories (e.g., β€œgood” or β€œbad” customer).

The Linear Assumption

Linear regression assumes a functional relationship between the predictor variables and the target variable . The standard model is expressed as

where represents the learned weights and represents the error on noise:

Tip

Even if the relationship between and looks curved (non-linear) we still call it β€œlinear regression” as long as the model is linear in its parameters .

Ordinary Least Squares (OLS)

The objective of OLS is to find the optimal weights by minimising the Residual Sum of Squares (RSS).

Step-by-Step Derivation

  1. Define the Residual : The error for a single data point.
  1. Define the Cost Function : The sum of all squared residuals.
  1. Minimise: Take the partial derivatives of with respect to each weight and set them to zero.
  1. Solve the System: This leads to a set of linear equations known as the Normal Equations*.

The Matrix Solution

For more complex models, we use the Design Matrix (). The closed-form OLS solution is:

The term is known as the Moore-Penrose pseudo-inverse ()

Warning

Directly computing the matrix inversion can be numerically unstable or computationally expensive if the matrix is very large. In these cases, Singular Value Decomposition (SVD) is often used.

Basis Functions

To model non-linear patterns, we can transform the input using a basis function . The model remains linear because the weights are applied linearly to these transformed features.

Common Basis Functions

  • Polynomial: (e.g., ).
  • Gaussian (RBF): .
  • Sigmoidal: .
  • Hyperbolic Tangent: .

Optimisation via Gradient Descent

When the closed-form solution is not feasible, we use Gradient Descent.

The Intuition

Imagine standing on a hill and wanting to reach the bottom. You look at the slope at your current position and take a step in the direction that goes down the fastest.

The Process

  1. Initialize weights randomly.
  2. Calculate the in-sample error .
  3. Update weights iteratively by moving in the opposite direction of the gradient.
  4. The size of the movement is determined by the step size (learning rate).

Important

Choosing the correct Step Size is critical.

  • If the step size is too small, convergence will be slow.
  • If the step size is too large, the algorithm may diverge, jumping over the minimum and heading towards infinity.