06 Support Vector Machines - Dual Predictions & Kernels

Making Predictions in the Dual Form

In the dual formulation of SVMs, we move away from calculating an explicit weight vector $w$ . Instead, we classify new instances based on their similarity to the training examples.

The Primal vs Dual Prediction

Primal Formula: $h (x) = w^{T} ϕ (x) + b$
Dual Formula: $h (x) = \sum_{n = 1}^{N} a^{(n)} y^{(n)} k (x, x^{(n)}) + b$

Decision Rule

If $h (x) > 0 \to$ Classify as +1.
If $h (x) < 0 \to$ Classify as -1.

Tip

The dual form is powerful because it allows us to work with kernels $(k (x, x^{(n)}))$ , which can represent infinite-dimensional feature space that would be impossible to compute directly in the primal form.

The Role of Support Vectors

A critical property of the dual form is sparsity. We do not actually need to sum over all $N$ training examples.

KKT Condition and Sparsity

Based on the Karun-Kuhn-Tucker (KKT) conditions:

Non-Support Vectors: For most points, $a^{(n)} = 0$ . These points do not affect the decision boundary.
Support Vectors (S): Only points where $a^{(n)} > 0$ contribute to the prediction.
The Margin: Support vectors satisfy the condition $y^{(n)} h (x^{(n)}) = 1$ , meaning they lie exactly on the margin

Optimised Prediction Formula

h (x) = n \in S \sum a^{(n)} y^{(n)} k (x, x^{(n)}) + b

Where $S$ is the set of indices of the support vectors

Calculating the Bias Term $(b)$

The bias $b$ is not solved directly in the dual optimisation but is recovered using the support vectors.

Step-by-Step Derivation

Pick any support vectors $(x_{n}, y_{n})$ , where $a_{n} > 0$
Since its on the margin $y^{(n)} h (x^{(n)}) = 1$
Substitute the dual form of $h (x)$ : $y^{(n)} (\sum_{m \in S} a^{(m)} y^{(m)} k (x^{(n)}, x^{(m)}) + b) = 1$
Multiply both side by $y (n)$ (since $y (n)^{2} = 1$ ): $\sum_{m \in S} a^{(m)} y^{(m)} k (x^{(n)}, x^{(m)}) + b = y^{(n)}$ .
Solve for $b$ : $b = y^{(n)} - \sum_{m \in S} a^{(m)} y^{(m)} k (x^{(n)}, x^{(m)})$ )

Important

To ensure numerical stability, it is standard practice to calculate $b$ for all support vectors and take the average:
$= \frac{1}{N _{S}} n \in S \sum y^{(n)} - m \in S \sum a^{(m)} y^{(m)} k (x^{(n)}, x^{(m)})$
Where $N_{S}$ is the total number of support vectors

Kernels as Similarity Functions

A kernel $k (x, z)$ represents the inner product of two vectors in high-dimensional features space: $k (x, z) = ϕ (x)^{T} ϕ (z)$

Mercer’s Condition

To be a valid kernel, a function must satisfy Mercer’s Condition:

Symmetry: $k (x, z) = k (z, x)$ .
Positive Semi-definite: The Gram matrix $K$ (where $K_{i, j} = k (x_{i}, x_{j})$ ) must satisfy $z^{T} Kz \geq 0$ for any vector $z$ .

Kernel Composition Rules

You can build complex kernels from simpler ones ( $k_{1}, k_{2}$ ). Valid operations include:

Addition: $k_{1} + k_{2}$ .
Scaling: $c \cdot k_{1}$ (where $c > 0$ ).
Exponentiation: $e^{k_{1}}$ .
Product: $k_{1} \cdot k_{2}$ .

Common Kernel Function

Linear Kernel

Formula: $k (x, z) = x^{T} z$ .
Intuition: No transformation; equivalent to standard linear SVM.

Polynomial Kernel

Formula: $k (x, z) = (1 + x^{T} z)^{p}$ .
Intuition: Maps data into a space of polynomial combinations of features.

Gaussian / Radial Basis Function (RBF) Kernel

Formula: $k (x, z) = exp (- \frac{∥ x - z ∥ ^{2}}{2 σ ^{2}})$ .
Intuition: Measures closeness; the value is 1 if $x = z$ and drops toward 0 as they move apart.
Technical Note: The RBF kernel corresponds to an infinite-dimensional feature mapping $ϕ (x)$ , which can be shown via Taylor series expansion.

Kernels for Structured Data

Kernels allow SVMs to handle data that isn’t naturally represented as a fixed-length vector of numbers.

Sets

For two subsets $A_{1}, A_{2}$ :

Kernel: $k (A_{1}, A_{2}) = 2^{∣ A_{1} \cap A_{2} ∣}$ .

Strings (All-Subsequence Kernel)

Approach: Compares strings based on the number of shared subsequences.
Complexity: Direct calculation is exponential, but Dynamic Programming reduces it to $O (∣ s ∣∣ t ∣)$ .

Trees (All-Subtree Kernel)

Approach: Counts the number of common subtrees between two trees.
Efficiency: Computed in $O (∣ T_{1} ∣∣ T_{2} ∣ d_{ma x}^{2})$ using Dynamic Programming.

Ayush Acharjya's Notes

Explorer

06 Support Vector Machines - Dual Predictions & Kernels

Making Predictions in the Dual Form

The Primal vs Dual Prediction

Decision Rule

The Role of Support Vectors

KKT Condition and Sparsity

Calculating the Bias Term $(b)$

Step-by-Step Derivation

Kernels as Similarity Functions

Mercer’s Condition

Kernel Composition Rules

Common Kernel Function

Linear Kernel

Polynomial Kernel

Gaussian / Radial Basis Function (RBF) Kernel

Kernels for Structured Data

Sets

Strings (All-Subsequence Kernel)

Trees (All-Subtree Kernel)

Graph View

Table of Contents

Backlinks

Ayush Acharjya's Notes

Explorer

06 Support Vector Machines - Dual Predictions & Kernels

Making Predictions in the Dual Form

The Primal vs Dual Prediction

Decision Rule

The Role of Support Vectors

KKT Condition and Sparsity

Calculating the Bias Term (b)

Step-by-Step Derivation

Kernels as Similarity Functions

Mercer’s Condition

Kernel Composition Rules

Common Kernel Function

Linear Kernel

Polynomial Kernel

Gaussian / Radial Basis Function (RBF) Kernel

Kernels for Structured Data

Sets

Strings (All-Subsequence Kernel)

Trees (All-Subtree Kernel)

Graph View

Table of Contents

Backlinks

Calculating the Bias Term $(b)$