07 Soft Margin SVM - Primal and Dual Formulations

Standard “Hard Margin” SVMs can suffer from “overfitting” because they attempt to classify every single training point perfectly. This often leads to fitting the noise in the data rather than the underlying pattern, which degrades performance on unseen data.
By allowing a Soft Margin, the model can ignore certain outliers or noisy points, resulting in a simpler decision boundary that typically generalises better

Slack Variables $(ξ)$

To implement a soft margin, we introduce a slack variable $ξ^{(n)}$ for every training example $(x^{(n)}, y^{(n)})$ where $ξ^{(n)} \geq 0$ . These variables measure the “error” or “displacement” of a point relative to its ideal position.

Intuition of Slack Values

The value of $ξ^{(n)}$ tells us exactly where a point lies in relation to the decision boundary and the margin.

$ξ^{(n)} = 0$ : The point is correctly classified and lies either on or outside the margin.
- $y^{(n)} h (x^{(n)}) > 1 \to$ outside margin
- $y^{(n)} h (x^{(n)}) = 1 \to$ exactly on margin
$0 < ξ^{(n)} < 1$ : The point correctly classified but falls within the margin area.
$ξ^{(n)} = 1$ : The point sits exactly on the decision boundary.
$ξ^{(n)} > 1$ : The point is misclassified (on the wrong side of the decision boundary).

Tip

In exam problems, if a point is “correctly classified but inside the margin”, its slack variable $ξ$ must be between $0$ and $1$ .

The Primal Optimisation Problem

The goal is to find a balance between a large margin and small classification errors.

The Objective Function

We minimise the following:

ar g w, b, ξ min ⎩ ⎨ ⎧ \frac{1}{2} ∥ w ∥^{2} + C n = 1 \sum N ξ^{(n)} ⎭ ⎬ ⎫

Subject to

$y_{n} (w^{T} ϕ (x^{(n)}) + b) \geq 1 - ξ^{(n)}$ (Modified margin constraint)
$ξ^{(n)} \geq 0$ for all $n$ .

The Role of Hyperparameter $C$

$C$ acts as a “penalty” for errors.

Large $C$ : Penalise slack heavily, forcing the model to behave like a Hard Margin SVM with a narrower margin and fewer errors.
Small $C$ : More tolerant of slack, allowing a wider margin even if it means more training points are misclassified or fall within the margin.

The Dual Formulation

To solve the optimisation efficiently (especially with kernels), we convert the primal problems into a dual representation using Lagrange Multipliers $(a^{(n)})$

The Dual Objective

ar g a max \tilde{L} (a) = n = 1 \sum N a^{(n)} - \frac{1}{2} n = 1 \sum N m = 1 \sum N a^{(n)} a^{(m)} y^{(n)} y^{(m)} K (x^{(n)}, x^{(m)})

Subject to (Box Constraints):

$0 \leq a^{(n)} \leq C$ for all $n$
$\sum_{n = 1}^{N} a^{(n)} y^{(n)} = 0$

Important

The primary difference between Hard Margin and Soft Margin duals is the addition of the upper boundary $C$ on the Lagrange multipliers $a_{n}$

Predictions and Support Vectors

Once the model is trained, we make predictions for a new point $x$ using:

h (x) = n \in S \sum a^{(n)} y^{(n)} k (x, x^{(n)}) + b

Identifying Support Vectors

In a Soft Margin SVM, Support Vectors $(S)$ are all example where $a^{(n)} > 0$ .

$a^{(n)} = 0$ , not a support vector, correctly classified and outside margin
If $0 < a^{(n)} < C$ , the point lies exactly on the margin $(ξ^{(n)} = 0)$ .
If $a^{(n)} = C$ , the point is a non-margin support vector and lies inside the margin or is misclassified $(ξ^{(n)} > 0)$

Calculating the Bias $b$

The bias term is calculated by averaging the results from support vectors that lie exactly on the margin $(M)$ :

b = \frac{1}{∣ N _{M} ∣} n \in M \sum y^{(n)} - m \in S \sum a^{(m)} y^{(m)} K (x^{(n)}, x^{(m)})

Step-by-Step: From Primal to Dual

Define Constraints: Express constraints as $f_{i} (x) \leq 0$ . For SVM, these are $1 - ξ^{(n)} - y_{n} (w^{T} ϕ (x_{n}) + b) \leq 0$ and $- ξ_{n} \leq 0$ .
Lagrange Relaxation: Create the Lagrangian function $L (w, b, ξ, a, β)$ by adding constraints multiplied by Lagrange multipliers $a^{(n)}$ and $β^{(n)}$ .
KKT Stationarity: Take derivatives of $L$ with respect to the primal variables ( $w, b, ξ$ ) and set them to zero.

$\frac{\partial L}{\partial w} = 0 ⟹ w = \sum a^{(n)} y^{(n)} ϕ (x^{(n)})$
$\frac{\partial L}{\partial b} = 0 ⟹ \sum a^{(n)} y^{(n)} = 0$
$\frac{\partial L}{\partial ξ _{n}} = 0 ⟹ C - a^{(n)} - β^{(n)} = 0$

Substitution: Substitute these back into the Lagrangian to eliminate primal variables, resulting in the dual function $\tilde{L} (a)$

Complementary Slackness

From the KKT conditions, we obtain:

a^{(n)} (1 - y^{(n)} h (x^{(n)}) - ξ^{(n)}) = 0 β^{(n)} ξ^{(n)} = 0

Using stationarity:

\frac{\partial L}{\partial ξ _{n}} = 0 \Rightarrow C - a^{(n)} - β^{(n)} = 0 \Rightarrow β^{(n)} = C - a^{(n)}

Substituting:

(C - a^{(n)}) ξ^{(n)} = 0

Ayush Acharjya's Notes

Explorer

07 Soft Margin SVM - Primal and Dual Formulations

Slack Variables $(ξ)$

Intuition of Slack Values

The Primal Optimisation Problem

The Objective Function

Subject to

The Role of Hyperparameter $C$

The Dual Formulation

The Dual Objective

Subject to (Box Constraints):

Predictions and Support Vectors

Identifying Support Vectors

Calculating the Bias $b$

Step-by-Step: From Primal to Dual

Complementary Slackness

Graph View

Table of Contents

Backlinks

Ayush Acharjya's Notes

Explorer

07 Soft Margin SVM - Primal and Dual Formulations

Slack Variables (ξ)

Intuition of Slack Values

The Primal Optimisation Problem

The Objective Function

Subject to

The Role of Hyperparameter C

The Dual Formulation

The Dual Objective

Subject to (Box Constraints):

Predictions and Support Vectors

Identifying Support Vectors

Calculating the Bias b

Step-by-Step: From Primal to Dual

Complementary Slackness

Graph View

Table of Contents

Backlinks

Slack Variables $(ξ)$

The Role of Hyperparameter $C$

Calculating the Bias $b$