The fundamental goal of machine learning is to find a set of parameters that minimises a loss function. Gradient descent is an optimisation algorithm used to find the minimum (the “lowest point in a hilly landscape”) of this loss function iteratively updating paramters.
Algorithm
The process follows a repetitive loop to approach the optimal estimate:
- Initialisation: Start with a random initial estimate for the parameters (e.g., or ).
- Computation: At each step , compute the gradient (), which is the vector of partial derivatives of the loss function with respect to the parameters.
- Update Rule: Move in the opposite direction of the gradient to descent the loss curve:
- The gradient points uphill; therefore, subtracting it moves the estimate downhill
- (Learning Rate): A scalar that determines the size of the step taken in each iteration
Loss Function and Optimisation
Different optimisation problems use different loss functions, which impact how the gradient is calculated:
- Squared Error (Quadratic): Minimising this is equivalent to finding the Maximum Likelihood estimate if measurements follow a normal distribution.
- Absolute Error: The derivative is or (and undefined at ), meaning steps have a constant magnitude regardless of how far the estimate is from the minimum.
- Binary Cross Entropy: Used for classification tasks. It corresponds to Maximum Likelihood for logistic regression models.
The Role of Learning Rate
Selecting the correct rate is critical for “Good Convergence”:
- Too Small: Results in very slow convergence, requiring many iterations to reach the minimum.
- Too Large: Can cause oscillations, where the estimate bounces back and forth across the minimum, or divergence, where the loss increases or results in “Nans”
- Strategy: Often found via trial and error or by reducing the rate over time as the algorithm approaches the optimum.
Gradient Descent Variants
To balance computational efficient and accuracy, several variants exist:
- Vanilla Gradient Descent (GD): Sums the gradient over all training data in every step. it is computationally expensive for large dataset but provides high accuracy.
- Stochastic Gradient Descent (SGD): Uses a single random example to compute the gradient at each step. It is much faster and can “jump out” of local minima, though its noisier.
- Mini-Batch Gradient Descent: A compromise that uses a random subset (batch) of the training data.
- Momentum: Speeds up convergence and helps avoid local optima by keeping a portion of the previous step’s direction:
- ADAM (Adaptive Moment Estimation): Combines momentum with automatically adjusted learning rates for individual parameter. It is currently the de facto standard for training neural networks.
Limitations and Challenges
- Stopping Criteria: The algorithm typically stops when there is no significant improvement in the loss.
- Local Minima: There is no guarantee of finding the global best solution; the algorithm may get trapped in “valleys” (local minima) that are not the absolute minimum.
- Computation: Standard GD is very expensive for large-scale data because each update requires a full pass through the dataset.