Fundamentals of Regression
Regression is a supervised learning task where the goal is to predict a continuous numerical value (e.g., a credit line amount) based on input features. This differs from classification, which predicts discrete categories (e.g., βgoodβ or βbadβ customer).
The Linear Assumption
Linear regression assumes a functional relationship between the predictor variables and the target variable . The standard model is expressed as
where represents the learned weights and represents the error on noise:
Tip
Even if the relationship between and looks curved (non-linear) we still call it βlinear regressionβ as long as the model is linear in its parameters .
Ordinary Least Squares (OLS)
The objective of OLS is to find the optimal weights by minimising the Residual Sum of Squares (RSS).
Step-by-Step Derivation
- Define the Residual : The error for a single data point.
- Define the Cost Function : The sum of all squared residuals.
- Minimise: Take the partial derivatives of with respect to each weight and set them to zero.
- Solve the System: This leads to a set of linear equations known as the Normal Equations*.
The Matrix Solution
For more complex models, we use the Design Matrix (). The closed-form OLS solution is:
The term is known as the Moore-Penrose pseudo-inverse ()
Warning
Directly computing the matrix inversion can be numerically unstable or computationally expensive if the matrix is very large. In these cases, Singular Value Decomposition (SVD) is often used.
Basis Functions
To model non-linear patterns, we can transform the input using a basis function . The model remains linear because the weights are applied linearly to these transformed features.
Common Basis Functions
- Polynomial: (e.g., ).
- Gaussian (RBF): .
- Sigmoidal: .
- Hyperbolic Tangent: .
Optimisation via Gradient Descent
When the closed-form solution is not feasible, we use Gradient Descent.
The Intuition
Imagine standing on a hill and wanting to reach the bottom. You look at the slope at your current position and take a step in the direction that goes down the fastest.
The Process
- Initialize weights randomly.
- Calculate the in-sample error .
- Update weights iteratively by moving in the opposite direction of the gradient.
- The size of the movement is determined by the step size (learning rate).
Important
Choosing the correct Step Size is critical.
- If the step size is too small, convergence will be slow.
- If the step size is too large, the algorithm may diverge, jumping over the minimum and heading towards infinity.