02 Optimisation in Logistic Regression

Maximum Likelihood Estimation (MLE)

The core objective in learning a Logistic Regression model is finding the weight vector $w$ that best fits the training data $T = {(x^{(1)}, y^{(1)}), \dots, (x^{(N)}, y^{(N)})}$ .

Likelihood Function

The likelihood function $L (w)$ represents how likely we are to observe the target outputs given out inputs and chosen weights. Assuming examples are independently identically distributed (i.i.d), the joint conditional likelihood is the product of individual probabilities.

L (w) = p (y ∣ X, w) = i = 1 \prod N p (y^{(i)} ∣ x^{(i)}, w)

Log-Likelihood and Loss

Directly maximising the product in the likelihood function can lead to numerical instability. To solve this, we maximise the. Log-Likelihood instead. Because the logarithm is a monotonically increasing function, the weights that maximises the log-likelihood also maximises the original likelihood.

In machine learning, we typically frame this as minimisation problem by defining a Loss Function $E (w)$ as the negative log-likelihood.

E (w) = - ln (L (w)) = - i = 1 \sum N ln p (y^{(i)} ∣ x^{(i)}, w)

Tip

Key Insight: Maximising likelihood is mathematically identical to minimising Cross-Entropy Loss.

Cross-Entropy Loss for Logistic Regression

For binary classification $(y \in {0, 1})$ , the loss function can be expanded into Cross Entropy formula:

Cross entropy measures how well your model’s predicted probability distribution matches the true labels

E (w) = - i = 1 \sum N [y^{(i)} ln (p_{1}) + (1 - y^{(i)}) ln (1 - p_{1})]

Where $p_{1} = p (1 ∣ x^{(i)}, w)$ is the predicted probability for class 1.

Intuition: If the true class is 1 $(y^{(i)} = 1)$ , the second term disappears, and we aim to maximise $ln (p_{1})$ , which means pushing $p_{1}$ towards $1$ .
Penalty: The loss heavily penalises “confident” incorrect predictions (e.g., predicting $p_{1} \approx 0$ when the true label is $1$ ).

Optimisation via Gradient Descent

Gradient Descent updates the weights iteratively to find the minimum of $E (w)$ .

The Update Rule

The weight update is performed by moving in the direction of the negative gradient.

w = w - η \nabla E (w)

$w$ : The weight vector
$η$ : The learning rate, a hyperparameter controlling step size.
$\nabla E (w)$ : The gradient, a vector of partial derivatives $\frac{\partial E}{\partial w _{i}}$ representing the rate of change for each weight.

Gradient for Logistic Regression

For Cross-Entropy Loss, the gradient is calculated as:

\nabla E (w) = i = 1 \sum N (p (1∣ x^{(i)}, w) - y^{(i)}) x^{(i)}

Limitations of Gradient Descent

Learning Rate Sensitivity: If $η$ is too large, the algorithm may overshoot the optimum, if too small, convergence is slow.
Differential Curvature: If different features have vastly different scales, the loss function may be elongated (like an elliptical bowl), causing GD to oscillate and progress slowly.
Local minima: While GD can get stuck in local minima for complex functions, this is not an issue for Logistic Regression with Cross-Entropy loss because the function is strictly convex.

Newton-Raphson and IRLS

The Newton-Raphson method (often called Iterative Reweighted Least Squares in this context) improves upon by GD by using second-oder derivative information.

Intuition: Using Curvature

While GD knows the “slope”, Newton-Raphson looks at the curvature (how the slope is changing). This allows it to take large steps where the gradient is flat and smaller, more cautious steps where the gradient changes rapidly

Using Taylor Polynomial for a Local Approximation of $E (w)$

The Taylor polynomial of degree $n$ can be used to approximate a function $E (w)$ at $w_{0}$ :

T_{n} (w) = k = 0 \sum n \frac{E ^{(k)} ( w _{0} )}{k !} (w - w_{0})^{k}

Univariate Case

Newton-Raphson is derived from a second-degree Taylor Polynomial approximation of the loss function $E (w)$ around the current point $w_{0}$ .

Approximate: $E (w) \approx E (w_{0}) + (w - w_{0}) E^{'} (w_{0}) + \frac{1}{2} (w - w_{0})^{2} E^{''} (w_{0})$
Minimise: To find the minimum of this quadratic approximation, set its derivative with respect to $w$ to zero and solve for $w$ .

E^{'} (w_{0}) + (w - w_{0}) E^{''} (w_{0}) = 0 w = w_{0} - \frac{E ^{'} ( w _{0} )}{E ^{''} ( w _{0} )} \Rightarrow w = w - \frac{E ^{'} ( w )}{E ^{''} ( w )}

Multivariate Case

Second Order Partial Derivatives and The Hessian

H (f (x)) = H_{f} (x) \frac{\partial ^{2} f}{\partial x _{0}^{2}} \frac{\partial ^{2} f}{\partial x _{1} \partial x _{0}} ⋮ \frac{\partial ^{2} f}{\partial x _{d} \partial x _{0}} \frac{\partial ^{2} f}{\partial x _{0} \partial x _{1}} \frac{\partial ^{2} f}{\partial x _{1}^{2}} ⋮ \frac{\partial ^{2} f}{\partial x _{d} \partial x _{1}} \dots \dots ⋱ \dots \frac{\partial ^{2} f}{\partial x _{0} \partial x _{d}} \frac{\partial ^{2} f}{\partial x _{1} \partial x _{d}} ⋮ \frac{\partial ^{2} f}{\partial x _{d}^{2}}

Using the intuition for Univariate we get

w = w - H^{- 1} \nabla E (w)

Application to Logistic Regression

For Logistic Regression, the Hessian $H$ is calculated as:

H = i = 1 \sum N p (1∣ x^{(i)}, w) (1 - p (1∣ x^{(i)}, w)) x^{(i)} (x^{(i)})^{T}

Note

Performance: If the loss function were perfectly quadratic, Newton-Raphson would reach optimum in a single step. Since Cross-Entropy is not quadratic, we must apply the rule iteratively.

Step-by-Step Optimisation Process

Initialise: Set weights $w$ to zeros or small random values.
Predict: Calculate the probabilities $p_{1}$ for all training examples using the current weights.
Compute Gradient: Calculate $\nabla E (w)$ using the difference between predictions and actual targets.
Update Weights:
- For Gradient Descent: Subtract $η \nabla E (w)$
- For Newton-Raphson: Calculate the Hessian $H$ , invert it and subtract $H^{- 1} \nabla E (w)$ .
Repeat: Continue until the gradient is near zero or a maximum number of iterations is reached.

Ayush Acharjya's Notes

Explorer

02 Optimisation in Logistic Regression

Maximum Likelihood Estimation (MLE)

Likelihood Function

Log-Likelihood and Loss

Cross-Entropy Loss for Logistic Regression

Optimisation via Gradient Descent

The Update Rule

Gradient for Logistic Regression

Limitations of Gradient Descent

Newton-Raphson and IRLS

Intuition: Using Curvature

Using Taylor Polynomial for a Local Approximation of $E (w)$

Univariate Case

Multivariate Case

Application to Logistic Regression

Step-by-Step Optimisation Process

Graph View

Table of Contents

Backlinks

Ayush Acharjya's Notes

Explorer

02 Optimisation in Logistic Regression

Maximum Likelihood Estimation (MLE)

Likelihood Function

Log-Likelihood and Loss

Cross-Entropy Loss for Logistic Regression

Optimisation via Gradient Descent

The Update Rule

Gradient for Logistic Regression

Limitations of Gradient Descent

Newton-Raphson and IRLS

Intuition: Using Curvature

Using Taylor Polynomial for a Local Approximation of E(w)

Univariate Case

Multivariate Case

Application to Logistic Regression

Step-by-Step Optimisation Process

Graph View

Table of Contents

Backlinks

Using Taylor Polynomial for a Local Approximation of $E (w)$