Lagrange Relaxation and Duality

In many machine learning problems, we need to minimise a function subject to certain inequalities constraints

The Lagrangian Function

To solve this, we define the Lagrangian :

  • (Lagrangian Multipliers): Must be .
  • The Penalty Logic: If a constraint is violated , then the term acts as a penalty, increasing the objective value. If the constraint is satisfied , the term is , β€œrewarding” the objective.

Primal and Dual Formulation

  1. Minimax Primal: . The optimiser tries to find the best while the β€œadversary” tries to penalise any constraint violation.
  2. Maxmin Dual: . This formulation is often easier to solve because we can sometime eliminate the inner minimisation using calculus.

Tip

Strong Duality (where Min-Max = Max-Min) holds for most ML problems because they are convex and have a feasible region where for all .

Karush-Kuhn-Tucker (KKT) Conditions

For a convex optimisation problem, a solution is optimal if and only if it satisfies the following KKT conditions:

  1. Stationarity: The gradient of the Lagrangian with respect to the primal variables must be zero.
  1. Complementary Slackness: for all . This implies that either the multiplier is zero or the constraint is β€œactive” .
  2. Feasibility: The original primal constraints and the dual constraints must be satisfied.

Deriving the SVM Dual Formulation

The original SVM Primal Problem aims to minimise the weights while ensuring all points are outside the margin.

Step-by-Step Derivation

  1. Rewrite Constraints: Transform to
  2. Form the Lagrangian:
  1. Apply Stationarity (KKT):
    • Set **:
    • Set :
  2. Substitute back into the Lagrangian: This yields the Dual Objective , which only depends on the multiplier .

Final SVM Dual Representation

Subject to: and

The Kernel Trick

The dual formulation highlights that the optimisation only depends on the inner product of the feature vectors:

  • Intuition: Instead of mapping data to a massive -dimensional space (which is expensive), we use a Kernel Function to compute the inner product directly in the original space.
  • Example (polynomial Kernel): For a input , a -degree mapping leads to:

This is much faster than calculating the -dimensional vector for every point.

Efficiency Comparison

  • Primal: Complexity depends on the dimensionality of the feature space ( variables).
  • Dual: Complexity depends on the number of training examples ( variables).