07 Gradient Descent

The fundamental goal of machine learning is to find a set of parameters that minimises a loss function. Gradient descent is an optimisation algorithm used to find the minimum (the “lowest point in a hilly landscape”) of this loss function iteratively updating paramters.

Algorithm

The process follows a repetitive loop to approach the optimal estimate:

Initialisation: Start with a random initial estimate for the parameters (e.g., $\overset{x}{^}$ or $θ_{0}$ ).
Computation: At each step $t$ , compute the gradient ( $\nabla L_{t}$ ), which is the vector of partial derivatives of the loss function with respect to the parameters.
Update Rule: Move in the opposite direction of the gradient to descent the loss curve:

θ_{t + 1} = θ_{t} - η \nabla L_{t}

The gradient points uphill; therefore, subtracting it moves the estimate downhill
$η$ (Learning Rate): A scalar that determines the size of the step taken in each iteration

Loss Function and Optimisation

Different optimisation problems use different loss functions, which impact how the gradient is calculated:

Squared Error (Quadratic): $L (\overset{x}{^}, x) = \sum (x_{i} - \overset{x}{^})^{2}$ Minimising this is equivalent to finding the Maximum Likelihood estimate if measurements follow a normal distribution.
Absolute Error: $L (\overset{x}{^}, x) = \sum ∣ x_{i} - \overset{x}{^} ∣$ The derivative is $+ 1$ or $- 1$ (and undefined at $0$ ), meaning steps have a constant magnitude regardless of how far the estimate is from the minimum.
Binary Cross Entropy: Used for classification tasks. It corresponds to Maximum Likelihood for logistic regression models.

The Role of Learning Rate $η$

Selecting the correct rate is critical for “Good Convergence”:

Too Small: Results in very slow convergence, requiring many iterations to reach the minimum.
Too Large: Can cause oscillations, where the estimate bounces back and forth across the minimum, or divergence, where the loss increases or results in “Nans”
Strategy: Often found via trial and error or by reducing the rate over time as the algorithm approaches the optimum.

Gradient Descent Variants

To balance computational efficient and accuracy, several variants exist:

Vanilla Gradient Descent (GD): Sums the gradient over all training data in every step. it is computationally expensive for large dataset but provides high accuracy.
Stochastic Gradient Descent (SGD): Uses a single random example to compute the gradient at each step. It is much faster and can “jump out” of local minima, though its noisier.
Mini-Batch Gradient Descent: A compromise that uses a random subset (batch) of the training data.
Momentum: Speeds up convergence and helps avoid local optima by keeping a portion of the previous step’s direction: $v_{t} = β v_{t - 1} + η \nabla L_{t}$
ADAM (Adaptive Moment Estimation): Combines momentum with automatically adjusted learning rates for individual parameter. It is currently the de facto standard for training neural networks.

Limitations and Challenges

Stopping Criteria: The algorithm typically stops when there is no significant improvement in the loss.
Local Minima: There is no guarantee of finding the global best solution; the algorithm may get trapped in “valleys” (local minima) that are not the absolute minimum.
Computation: Standard GD is very expensive for large-scale data because each update requires a full pass through the dataset.

Ayush Acharjya's Notes

Explorer

07 Gradient Descent

Algorithm

Loss Function and Optimisation

The Role of Learning Rate $η$

Gradient Descent Variants

Limitations and Challenges

Graph View

Table of Contents

Backlinks

Ayush Acharjya's Notes

Explorer

07 Gradient Descent

Algorithm

Loss Function and Optimisation

The Role of Learning Rate η

Gradient Descent Variants

Limitations and Challenges

Graph View

Table of Contents

Backlinks

The Role of Learning Rate $η$