04 Support Vector Machines

Linear Classifier Foundation

In a supervised learning problem, we are given a set of training example $T = {(x^{(1)}, y^{(1)}), \dots, (x^{(N)}, y^{(N)})}$ drawn from an unknown distribution. our goal is to learn a function $g : R^{d} \to {- 1, + 1}$ that generalises well to unseen data.
The decision boundary for a linear classifier is a hyperplane:

2D space: A line ( $w_{1} x_{1} + w_{2} x_{2} + w_{0} = 0$ ).
3D space: A plane.
d-Dimensional space: A hyperplane ( $w^{T} x + b = 0$ ). The classification rule is:
If $w^{T} x + b > 0 \to$ Class +1
If $w^{T} x + b < 0 \to$ Class -1

Why Maximise the Margin

While many hyperplanes can separate a dataset, SVMs seek the one with the maximum margin.

Overfitting Avoidance: A boundary too close to training examples is sensitive to noise. Maximising the distance $(γ)$ helps the model generalise to new points.
Support Vectors: Not every data point is equal. Only the points closest to the boundary (the Support Vectors) actually “support” or define the hyperplane. Moving other points doesn’t change the boundary as long as they stay outside the margin.

The Optimisation Problem

To find the best hyperplane, we must maximise the perpendicular distance from the hyperplane to the closest point $x^{(n)}$

Perpendicular Distance Formula

The distance from a point $x^{(n)}$ to the hyperplane $h (x) = w^{T} x + b = 0$ is:

dist (h, x^{(n)}) = \frac{∣ h ( x ^{(n)} ) ∣}{∥ w ∥}

where $∥ w ∥$ is the Euclidean norm (length) of the weight vector.

Constraints

For the classifier to be valid, all training examples must be correctly classified:

y^{(n)} (w^{T} x^{(n)} + b) > 0 \forall n

Deriving the Primal Form

Initial Goal: $ar g max_{w, b} {min_{n} \frac{y ^{(n)} ( w ^{T} x ^{(n)} + b )}{∥ w ∥}}$
The Rescaling Trick: Rescaling $w$ and $b$ by a constant $κ$ does not change the hyperplane. We can choose a scale such that for the closest points, $y^{(n)} (w^{T} x^{(n)} + b) = 1$
Simplified Constraint: This implies all other points satisfy $y^{(n)} (w^{T} x^{(n)} + b) \geq 1$
Final Objective: Maximising $\frac{1}{∥ w ∥}$ is mathematically identical to minimising $\frac{1}{2} ∥ w ∥^{2}$

IMPORTANT

The Primal SVM Optimization Task:
$ar g w, b min \frac{1}{2} ∥ w ∥^{2}$
Subject to: $y^{(n)} (w^{T} x^{(n)} + b) \geq 1 for n = 1, \dots, N$

Handling Nonlinear Problems

When data is not linearly separable (e.g., one class forms a circle inside another), we apply a non linear transformation $ϕ (x)$ to map the input data into a higher-dimensional feature space.

New Hypothesis: $h (x) = w^{T} ϕ (x) + b = 0$ .
Linear SVM: If we use $ϕ (x) = x$ , we have a standard linear SVM.
Polynomial Embedding: For example, using a degree-2 polynomial can create a curved boundary (like an ellipse) in the original space to separate circular data.

Ayush Acharjya's Notes

Explorer

04 Support Vector Machines

Linear Classifier Foundation

Why Maximise the Margin

The Optimisation Problem

Perpendicular Distance Formula

Constraints

Deriving the Primal Form

Handling Nonlinear Problems

Graph View

Table of Contents

Backlinks