Maximum Likelihood Estimation (MLE)
The core objective in learning a Logistic Regression model is finding the weight vector that best fits the training data .
Likelihood Function
The likelihood function represents how likely we are to observe the target outputs given out inputs and chosen weights. Assuming examples are independently identically distributed (i.i.d), the joint conditional likelihood is the product of individual probabilities.
Log-Likelihood and Loss
Directly maximising the product in the likelihood function can lead to numerical instability. To solve this, we maximise the. Log-Likelihood instead. Because the logarithm is a monotonically increasing function, the weights that maximises the log-likelihood also maximises the original likelihood.
In machine learning, we typically frame this as minimisation problem by defining a Loss Function as the negative log-likelihood.
Tip
Key Insight: Maximising likelihood is mathematically identical to minimising Cross-Entropy Loss.
Cross-Entropy Loss for Logistic Regression
For binary classification , the loss function can be expanded into Cross Entropy formula:
Cross entropy measures how well your model’s predicted probability distribution matches the true labels
Where is the predicted probability for class 1.
- Intuition: If the true class is 1 , the second term disappears, and we aim to maximise , which means pushing towards .
- Penalty: The loss heavily penalises “confident” incorrect predictions (e.g., predicting when the true label is ).
Optimisation via Gradient Descent
Gradient Descent updates the weights iteratively to find the minimum of .
The Update Rule
The weight update is performed by moving in the direction of the negative gradient.
- : The weight vector
- : The learning rate, a hyperparameter controlling step size.
- : The gradient, a vector of partial derivatives representing the rate of change for each weight.
Gradient for Logistic Regression
For Cross-Entropy Loss, the gradient is calculated as:
Limitations of Gradient Descent
- Learning Rate Sensitivity: If is too large, the algorithm may overshoot the optimum, if too small, convergence is slow.
- Differential Curvature: If different features heave vastly different scales, the loss function may be elongated (like an elliptical bowl), causing GD to oscillate and progress slowly.
- Local minima: While GD can get stuck in local minima for complex functions, this is not an issue for Logistic Regression with Cross-Entropy loss because the function is strictly convex.
Newton-Raphson and IRLS
The Newton-Raphson method (often called Iterative Reweighted Least Squares in this context) improves upon by GD by using second-oder derivative information.
Intuition: Using Curvature
While GD knows the “slope”, Newton-Raphson looks at the curvature (how the slope is changing). This allows it to take large steps where the gradient is flat and smaller, more cautious steps where the gradient changes rapidly
Using Taylor Polynomial for a Local Approximation of
The Taylor polynomial of degree can be used to approximate a function at :
Univariate Case
Newton-Raphson is derived from a second-degree Taylor Polynomial approximation of the loss function around the current point .
- Approximate:
- Minimise: To find the minimum of this quadratic approximation, set its derivative with respect to to zero and solve for .
Multivariate Case
Second Order Partial Derivatives and The Hessian
Using the intuition for Univariate we get
Application to Logistic Regression
For Logistic Regression, the Hessian is calculated as:
Note
Performance: If the loss function were perfectly quadratic, Newton-Raphson would reach optimum in a single step. Since Cross-Entropy is not quadratic, we must apply the rule iteratively.
Step-by-Step Optimisation Process
- Initialise*: Set weights to zeros or small random values.
- Predict: Calculate the probabilities for all training examples using the current weights.
- Compute Gradient: Calculate using the difference between predictions and actual targets.
- Update Weights:
- For Gradient Descent: Subtract
- For Newton-Raphson: Calculate the Hessian , invert it and subtract .
- Repeat: Continue until the gradient is near zero or a maximum number of iterations is reached.