Linear Classifier Foundation

In a supervised learning problem, we are given a set of training example drawn from an unknown distribution. our goal is to learn a function that generalises well to unseen data.
The decision boundary for a linear classifier is a hyperplane:

  • 2D space: A line ().
  • 3D space: A plane.
  • d-Dimensional space: A hyperplane (). The classification rule is:
  • If Class +1
  • If Class -1

Why Maximise the Margin

While many hyperplanes can separate a dataset, SVMs seek the one with the maximum margin.

  • Overfitting Avoidance: A boundary too close to training examples is sensitive to noise. Maximising the distance helps the model generalise to new points.
  • Support Vectors: Not every data point is equal. Only the points closest to the boundary (the Support Vectors) actually β€œsupport” or define the hyperplane. Moving other points doesn’t change the boundary as long as they stay outside the margin.

The Optimisation Problem

To find the best hyperplane, we must maximise the perpendicular distance from the hyperplane to the closest point

Perpendicular Distance Formula

The distance from a point to the hyperplane is:

where is the Euclidean norm (length) of the weight vector.

Constraints

For the classifier to be valid, all training examples must be correctly classified:

Deriving the Primal Form

  • Initial Goal:
  • The Rescaling Trick: Rescaling and by a constant does not change the hyperplane. We can choose a scale such that for the closest points,
  • Simplified Constraint: This implies all other points satisfy
  • Final Objective: Maximising is mathematically identical to minimising

IMPORTANT

The Primal SVM Optimization Task:

Subject to:

Handling Nonlinear Problems

When data is not linearly separable (e.g., one class forms a circle inside another), we apply a non linear transformation to map the input data into a higher-dimensional feature space.

  • New Hypothesis: .
  • Linear SVM: If we use , we have a standard linear SVM.
  • Polynomial Embedding: For example, using a degree-2 polynomial can create a curved boundary (like an ellipse) in the original space to separate circular data.