05 Lagrange Duality and SVM Dual Formulation

Lagrange Relaxation and Duality

In many machine learning problems, we need to minimise a function $F (x)$ subject to certain inequalities constraints $f_{i} (x) \leq 0$

The Lagrangian Function

To solve this, we define the Lagrangian $L (x, a)$ :

L (x, a) = F (x) + i = 1 \sum N a_{i} f_{i} (x)

$a_{i}$ (Lagrangian Multipliers): Must be $\geq 0$ .
The Penalty Logic: If a constraint is violated $(f_{i} (x) > 0)$ , then the term $a_{i} f_{i} (x)$ acts as a penalty, increasing the objective value. If the constraint is satisfied $(f_{i} (x) \leq 0)$ , the term is $\leq 0$ , “rewarding” the objective.

Primal and Dual Formulation

Minimax Primal: $min_{x} max_{a} L (x, a)$ . The optimiser tries to find the best $x$ while the “adversary” $a$ tries to penalise any constraint violation.
Maxmin Dual: $max_{a} min_{x} L (x, a)$ . This formulation is often easier to solve because we can sometime eliminate the inner minimisation using calculus.

Tip

Strong Duality (where Min-Max = Max-Min) holds for most ML problems because they are convex and have a feasible region where $f_{i} (x) < 0$ for all $i$ .

Karush-Kuhn-Tucker (KKT) Conditions

For a convex optimisation problem, a solution $(x^{⋆}, a^{⋆})$ is optimal if and only if it satisfies the following KKT conditions:

Stationarity: The gradient of the Lagrangian with respect to the primal variables must be zero.

\nabla_{x} L (x, a) = \nabla F (x) + \sum a_{i} \nabla f_{i} (x) = 0

Complementary Slackness: $a_{i} f_{i} (x) = 0$ for all $i$ . This implies that either the multiplier $a_{i}$ is zero or the constraint is “active” $(f_{i} (x) = 0)$ .
Feasibility: The original primal constraints $(f_{i} (x) \leq 0)$ and the dual constraints $(a_{i} \geq 0)$ must be satisfied.

Deriving the SVM Dual Formulation

The original SVM Primal Problem aims to minimise the weights $w$ while ensuring all points are outside the margin.

w, b min \frac{1}{2} ∥ w ∥^{2} subject to: y^{(n)} (w^{T} ϕ (x^{(n)}) + b) \geq 1

Step-by-Step Derivation

Rewrite Constraints: Transform to $1 - y^{(n)} (w^{T} ϕ (x^{(n)}) + b) \leq 0$
Form the Lagrangian:

L (w, b, a) = \frac{1}{2} ∥ w ∥^{2} + n = 1 \sum N a^{(n)} [1 - y^{(n)} (w^{T} ϕ (x^{(n)}) + b)]

Apply Stationarity (KKT):
- Set $\nabla_{w} L = 0$ : $w - \sum a^{(n)} y^{(n)} ϕ (x^{(n)}) = 0 ⟹ w = \sum_{n = 1}^{N} a^{(n)} y^{(n)} ϕ (x^{(n)})$
- Set $\partial L / \partial b = 0$ : $\sum_{n = 1}^{N} a^{(n)} y^{(n)} = 0$
Substitute back into the Lagrangian: This yields the Dual Objective $(\tilde{L})$ , which only depends on the multiplier $a$ .

Final SVM Dual Representation

argmax_{a} n = 1 \sum N a^{(n)} - \frac{1}{2} n = 1 \sum N m = 1 \sum N a^{(n)} a^{(m)} y^{(n)} y^{(m)} k (x^{(n)}, x^{(m)})

Subject to: $\sum a^{(n)} y^{(n)} = 0$ and $a^{(n)} \geq 0$

The Kernel Trick

The dual formulation highlights that the optimisation only depends on the inner product of the feature vectors: $k (x^{(n)}, x^{(m)}) = ϕ (x^{(n)})^{T} ϕ (x^{(m)})$

Intuition: Instead of mapping data to a massive $d$ -dimensional space $ϕ (x)$ (which is expensive), we use a Kernel Function to compute the inner product directly in the original space.
Example (polynomial Kernel): For a $2 D$ input $x$ , a $2 n d$ -degree mapping $ϕ (x)$ leads to:

k (x, z) = (1 + x^{T} z)^{2}

This is much faster than calculating the $6$ -dimensional vector $ϕ (x)$ for every point.

Efficiency Comparison

Primal: Complexity depends on the dimensionality of the feature space ( $d + 1$ variables).
Dual: Complexity depends on the number of training examples ( $N$ variables).

Ayush Acharjya's Notes

Explorer

05 Lagrange Duality and SVM Dual Formulation

Lagrange Relaxation and Duality

The Lagrangian Function

Primal and Dual Formulation

Karush-Kuhn-Tucker (KKT) Conditions

Deriving the SVM Dual Formulation

Step-by-Step Derivation

Final SVM Dual Representation

The Kernel Trick

Efficiency Comparison

Graph View

Table of Contents

Backlinks