09 Linear Regression

Fundamentals of Regression

Regression is a supervised learning task where the goal is to predict a continuous numerical value (e.g., a credit line amount) based on input features. This differs from classification, which predicts discrete categories (e.g., “good” or “bad” customer).

The Linear Assumption

Linear regression assumes a functional relationship between the predictor variables $(x)$ and the target variable $(y)$ . The standard model is expressed as

y = f (x, w) + ϵ

where $w$ represents the learned weights and $ϵ$ represents the error on noise:

Tip

Even if the relationship between $x$ and $y$ looks curved (non-linear) we still call it “linear regression” as long as the model is linear in its parameters $(w)$ .

Ordinary Least Squares (OLS)

The objective of OLS is to find the optimal weights $(w)$ by minimising the Residual Sum of Squares (RSS).

Step-by-Step Derivation

Define the Residual $(r_{1})$ : The error for a single data point.

r_{i} = y_{i} - \overset{y_{i}}{^} = y_{i} - (ω_{0} + ω_{1} x_{i})

Define the Cost Function $(R (w))$ : The sum of all squared residuals.

R (w) = i = 1 \sum N r_{i}^{2} = i = 1 \sum N (y_{i} - \overset{y_{i}}{^} (x_{i}, w))^{2}

Minimise: Take the partial derivatives of $R$ with respect to each weight and set them to zero.

\frac{\partial R ( w _{0} , w _{1} )}{\partial w _{0}} = 0, \frac{\partial R ( w _{0} , w _{1} )}{\partial w _{1}} = 0

Solve the System: This leads to a set of linear equations known as the Normal Equations*.

The Matrix Solution

For more complex models, we use the Design Matrix ( $Φ$ ). The closed-form OLS solution is:

ω_{O L S} = (Φ^{T} Φ)^{- 1} Φ^{T} y

The term $(Φ^{T} Φ)^{- 1} Φ^{T}$ is known as the Moore-Penrose pseudo-inverse ( $Φ^{†}$ )

Warning

Directly computing the matrix inversion $(Φ^{T} Φ)^{- 1}$ can be numerically unstable or computationally expensive if the matrix is very large. In these cases, Singular Value Decomposition (SVD) is often used.

Basis Functions

To model non-linear patterns, we can transform the input $x$ using a basis function $ϕ (x)$ . The model remains linear because the weights $w$ are applied linearly to these transformed features.

Common Basis Functions

Polynomial: $ϕ_{j} (x) = x^{j}$ (e.g., $1, x, x^{2}, x^{3}$ ).
Gaussian (RBF): $ϕ_{j} (x) = exp (- \frac{1}{2 σ ^{2}} (x - μ_{j})^{2})$ .
Sigmoidal: $g (α) = \frac{1}{1 + e ^{- α}}$ .
Hyperbolic Tangent: $h (α) = tanh (α)$ .

Optimisation via Gradient Descent

When the closed-form solution is not feasible, we use Gradient Descent.

The Intuition

Imagine standing on a hill and wanting to reach the bottom. You look at the slope at your current position and take a step in the direction that goes down the fastest.

The Process

Initialize weights $w$ randomly.
Calculate the in-sample error $E_{in} (w) = \frac{1}{N} ∥ Xw - y ∥^{2}$ .
Update weights iteratively by moving in the opposite direction of the gradient.
The size of the movement is determined by the step size (learning rate).

Important

Choosing the correct Step Size is critical.

If the step size is too small, convergence will be slow.

If the step size is too large, the algorithm may diverge, jumping over the minimum and heading towards infinity.

Ayush Acharjya's Notes

Explorer

09 Linear Regression

Fundamentals of Regression

The Linear Assumption

Ordinary Least Squares (OLS)

Step-by-Step Derivation

The Matrix Solution

Basis Functions

Common Basis Functions

Optimisation via Gradient Descent

The Intuition

The Process

Graph View

Table of Contents

Backlinks