Lagrange Relaxation and Duality
In many machine learning problems, we need to minimise a function subject to certain inequalities constraints
The Lagrangian Function
To solve this, we define the Lagrangian :
- (Lagrangian Multipliers): Must be .
- The Penalty Logic: If a constraint is violated , then the term acts as a penalty, increasing the objective value. If the constraint is satisfied , the term is , βrewardingβ the objective.
Primal and Dual Formulation
- Minimax Primal: . The optimiser tries to find the best while the βadversaryβ tries to penalise any constraint violation.
- Maxmin Dual: . This formulation is often easier to solve because we can sometime eliminate the inner minimisation using calculus.
Tip
Strong Duality (where Min-Max = Max-Min) holds for most ML problems because they are convex and have a feasible region where for all .
Karush-Kuhn-Tucker (KKT) Conditions
For a convex optimisation problem, a solution is optimal if and only if it satisfies the following KKT conditions:
- Stationarity: The gradient of the Lagrangian with respect to the primal variables must be zero.
- Complementary Slackness: for all . This implies that either the multiplier is zero or the constraint is βactiveβ .
- Feasibility: The original primal constraints and the dual constraints must be satisfied.
Deriving the SVM Dual Formulation
The original SVM Primal Problem aims to minimise the weights while ensuring all points are outside the margin.
Step-by-Step Derivation
- Rewrite Constraints: Transform to
- Form the Lagrangian:
- Apply Stationarity (KKT):
- Set **:
- Set :
- Substitute back into the Lagrangian: This yields the Dual Objective , which only depends on the multiplier .
Final SVM Dual Representation
Subject to: and
The Kernel Trick
The dual formulation highlights that the optimisation only depends on the inner product of the feature vectors:
- Intuition: Instead of mapping data to a massive -dimensional space (which is expensive), we use a Kernel Function to compute the inner product directly in the original space.
- Example (polynomial Kernel): For a input , a -degree mapping leads to:
This is much faster than calculating the -dimensional vector for every point.
Efficiency Comparison
- Primal: Complexity depends on the dimensionality of the feature space ( variables).
- Dual: Complexity depends on the number of training examples ( variables).