Lecture 8 · Chapter 13 · 17 April

Support Vector Machines &
Kernel Methods

SVMs find the hyperplane that maximizes the gap between classes — deriving this geometrically, formulating it as a constrained optimization problem, solving it via Lagrangian duality, and extending to nonlinear boundaries through the kernel trick.

Authors: Seixas Junior, Mahmud & Koren
Date: 17 April

00 · Motivation

From Linear Classifiers to SVMs

A simple linear classifier $f(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b$ separates the space with a hyperplane. In $d$ dimensions, this boundary is:

1D

A threshold point: $\hat{y} = +1$ if $x \geq t$, else $-1$.

2D

A line: $w_1x_1 + w_2x_2 = c$. One side is $+1$, the other $-1$.

$d$D

A hyperplane: $\mathbf{w}^T\mathbf{x} + b = 0$. $\mathbf{w}$ is its normal vector.

The Problem with Simple Separators

Many hyperplanes may separate the data equally well. Which one to choose? A purely geometric boundary ignores margin and noise — it will generalize poorly. SVMs solve this by finding the maximum-margin separator.

01 · Geometry

Margin and Distance to the Hyperplane

The margin is the perpendicular distance between the decision boundary and the nearest training points from each class. A larger margin means the classifier is more confident and generalizes better.

The perpendicular distance from any point $\mathbf{A}$ to the hyperplane $\mathbf{w}^T\mathbf{x} + b = 0$ is:

Point-to-Hyperplane Distance $$d(\mathbf{A}, \mathcal{H}) = \frac{|\mathbf{w}^T\mathbf{A} + b|}{\|\mathbf{w}\|}$$

Why the Margin is $2/\|\mathbf{w}\|$

Let $\mathbf{x}^+$ be on $\mathbf{w}^T\mathbf{x} + b = +1$ and $\mathbf{x}^-$ on $\mathbf{w}^T\mathbf{x} + b = -1$. Subtracting gives $\mathbf{w}^T(\mathbf{x}^+ - \mathbf{x}^-) = 2$, so the perpendicular distance between the two margin hyperplanes is $\frac{2}{\|\mathbf{w}\|}$.

02 · Hard Margin SVM

Hard-Margin SVM — Primal Form

Assume the data is linearly separable. We want to find $(\mathbf{w}, b)$ that maximizes the margin while correctly classifying all training points.

Hard-Margin SVM — Primal $$\min_{\mathbf{w},\,b}\;\frac{1}{2}\|\mathbf{w}\|^2 \qquad \text{subject to} \qquad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1,\;\forall i$$

Minimizing $\frac{1}{2}\|\mathbf{w}\|^2$ is equivalent to maximizing the margin $\frac{2}{\|\mathbf{w}\|}$
The constraint $y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1$ ensures all points lie on or outside the correct margin boundary
The $\frac{1}{2}$ factor is for mathematical convenience when differentiating

Support Vectors

The training points that satisfy $y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$ exactly — those that sit right on the margin boundary — are called support vectors. They are the only points that determine the decision boundary. Remove any non-support-vector and the solution doesn't change.

03 · Optimization

Lagrangian Construction & Dual Form

Introduce a Lagrange multiplier $\alpha_i \geq 0$ for each constraint:

Primal Lagrangian $$L(\mathbf{w},b,\boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^{N}\alpha_i\left[y_i(\mathbf{w}^T\mathbf{x}_i+b)-1\right]$$

KKT Conditions

Setting partial derivatives to zero gives us the dual variables:

Stationarity Conditions $$\frac{\partial L}{\partial \mathbf{w}} = 0 \implies \mathbf{w} = \sum_{i=1}^{N}\alpha_i y_i \mathbf{x}_i \qquad \frac{\partial L}{\partial b} = 0 \implies \sum_{i=1}^{N}\alpha_i y_i = 0$$

Substituting back into $L$ gives the dual objective — a function of $\boldsymbol{\alpha}$ only:

Dual Form (maximize over $\boldsymbol{\alpha}$) $$W(\boldsymbol{\alpha}) = \sum_{i=1}^{N}\alpha_i - \frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_i\alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j)$$ $$\text{subject to: } \alpha_i \geq 0,\quad \sum_{i=1}^{N}\alpha_i y_i = 0$$

Why the Dual Matters

The dual depends on training data only through dot products $\mathbf{x}_i \cdot \mathbf{x}_j$. This is the entry point for the kernel trick. Also, from the KKT complementarity condition: $\alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i+b)-1]=0$, so $\alpha_i > 0$ only for support vectors.

04 · Decision Function

Reconstructing the Decision Function

Recovering $\mathbf{w}$ and $b$, Then Classifying $$\mathbf{w}^* = \sum_{i=1}^{N}\alpha_i^* y_i \mathbf{x}_i \qquad b^* = y_k - (\mathbf{w}^*)^T\mathbf{x}_k \quad (\text{any support vector } k \text{ with } \alpha_k^* > 0)$$ $$\hat{y}(\mathbf{x}) = \text{sign}\!\left(\sum_{i=1}^{N}\alpha_i^* y_i (\mathbf{x}_i \cdot \mathbf{x}) + b^*\right)$$

Only support vectors (those with $\alpha_i^* > 0$) contribute to the decision function. All other training points can be discarded after training.

05 · Realistic Setting

Soft-Margin SVM

Real data is rarely linearly separable. The soft-margin SVM allows some points to violate the margin constraints, penalizing violations proportionally.

Soft-Margin Primal — Introducing Slack Variables $\xi_i \geq 0$ $$\min_{\mathbf{w},b,\boldsymbol{\xi}}\;\frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{N}\xi_i \qquad \text{s.t.}\quad y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1-\xi_i,\;\xi_i \geq 0$$

$\xi_i = 0$

Point correctly classified beyond the margin. No penalty.

$0 < \xi_i < 1$

Inside the margin but on the correct side. Margin violation penalized by $C\xi_i$.

$\xi_i \geq 1$

Misclassified. Penalty $C\xi_i \geq C$.

06 · Hyperparameter

The Role of Parameter C

$C$ controls the trade-off between maximizing the margin and minimizing constraint violations.

⬆️

Large C

Heavily penalize misclassifications. Prioritize correctness over margin width. Narrower margin, harder boundary. Risk: overfitting to noise.

⬇️

Small C

Allow more margin violations. Prioritize a wider margin. Potentially better generalization on noisy data. Risk: underfitting.

Tuning C

$C$ is a hyperparameter — tune via cross-validation, typically on a log scale: $C \in \{10^{-3}, 10^{-2}, \ldots, 10^3\}$.

07 · Dual

Soft-Margin Dual

The Lagrangian for the soft-margin adds multipliers $\mu_i \geq 0$ for the slack constraints. After applying KKT conditions, the dual form is almost identical to the hard-margin case, but with a box constraint on $\alpha_i$:

Soft-Margin Dual $$\max_{\boldsymbol{\alpha}}\;\sum_{i=1}^N\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j(\mathbf{x}_i\cdot\mathbf{x}_j) \qquad \text{s.t.}\quad 0 \leq \alpha_i \leq C,\;\sum_i \alpha_i y_i = 0$$

The only difference from the hard-margin dual: $\alpha_i$ is now bounded above by $C$ (the box constraint). The decision function is identical. Support vectors are those with $0 < \alpha_i \leq C$.

08 · Kernel Trick

The Kernel Trick

The SVM dual depends on data only through dot products $\mathbf{x}_i \cdot \mathbf{x}_j$. The kernel trick replaces these with a kernel function that implicitly computes dot products in a higher-dimensional space — without ever explicitly computing the feature map $\phi(\mathbf{x})$.

Kernel Function

$k(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x}) \cdot \phi(\mathbf{z})$ for some (possibly infinite-dimensional) feature map $\phi$. We can compute $k$ directly without knowing $\phi$.

Why This Is Powerful

A polynomial kernel of degree 2 in 2D implicitly works in a 6-dimensional feature space. An RBF kernel works in an infinite-dimensional space. But computing the kernel is just a single dot product operation — $O(d)$ — regardless of the implicit dimension.

Polynomial Kernel — Example

Degree-2 Polynomial Kernel ($d=2$) $$k(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T\mathbf{y} + 1)^2 = (x_1y_1 + x_2y_2 + 1)^2$$ $$\phi(\mathbf{x}) = [1,\,\sqrt{2}x_1,\,\sqrt{2}x_2,\,\sqrt{2}x_1x_2,\,x_1^2,\,x_2^2]^T$$

Computing $k(\mathbf{x},\mathbf{y})$ is just one scalar computation, but it equals $\phi(\mathbf{x}) \cdot \phi(\mathbf{y})$ — the dot product of 6-dimensional vectors. We get the expressiveness of 6D for the cost of 2D.

09 · Common Kernels

Kernel Functions in Practice

Polynomial

$k(\mathbf{x},\mathbf{z}) = (\mathbf{x}^T\mathbf{z} + 1)^q$

Degree $q$ controls boundary complexity. $q=1$ gives linear SVM. Works well for text classification, image recognition.

RBF / Gaussian

$k(\mathbf{x},\mathbf{z}) = e^{-\gamma\|\mathbf{x}-\mathbf{z}\|^2}$

Infinite-dimensional feature space. Decays with distance — local similarity. $\gamma$ controls width. Most popular kernel in practice.

Kernel Engineering

Strings, graphs, images

Kernels can be defined for any structured data type — as long as they compute a valid similarity (positive semi-definite). Enables SVM for non-vectorial inputs.

Combining Kernels

Valid Kernel Combinations $$k = c\,k_1 \quad\text{or}\quad k_1+k_2 \quad\text{or}\quad k_1 \cdot k_2 \quad\text{or}\quad \sum_i \eta_i k_i \quad\text{or}\quad \sum_i \eta_i(\mathbf{x}|\theta)k_i$$

Any positive linear combination, product, or scaling of valid kernels is itself a valid kernel. The adaptive kernel $\sum_i \eta_i(\mathbf{x}|\theta) k_i$ weights kernels as a function of the input — enabling localized flexibility.

10 · Kernelized SVM

Kernelized SVM Dual

Simply replace every dot product $\mathbf{x}_i \cdot \mathbf{x}_j$ in the soft-margin dual with $k(\mathbf{x}_i, \mathbf{x}_j)$:

Kernelized SVM Dual $$\max_{\boldsymbol{\alpha}}\;\sum_{i}\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j\, k(\mathbf{x}_i,\mathbf{x}_j), \quad 0 \leq \alpha_i \leq C,\;\sum_i\alpha_i y_i = 0$$ $$\hat{y}(\mathbf{x}) = \text{sign}\!\left(\sum_{i=1}^{N}\alpha_i^* y_i\, k(\mathbf{x}_i,\mathbf{x}) + b^*\right)$$

Nonlinear Boundary

In the original input space, the decision boundary is now a (possibly curved) hypersurface. In the implicit feature space $\phi$, it remains a linear hyperplane — the SVM stays linear, we just changed the space. This is why SVMs with kernels can have universal approximation capacity.

11 · Extensions

Multiclass SVM & One-Class SVM

Multiclass Strategies

One-vs-All (OvR) — train $K$ binary SVMs, one per class against all others. Classify by the class with highest confidence score.
One-vs-One (OvO) — train $\frac{K(K-1)}{2}$ binary SVMs, one per pair of classes. Classify by majority vote.
Single multiclass optimization — minimize $\frac{1}{2}\sum_k \|\mathbf{w}_k\|^2 + C\sum_i\sum_k \xi_i^k$ directly. Harder to solve but theoretically cleaner.

One-Class SVM — Anomaly Detection

Train on only positive/normal examples. Learn the tightest hypersphere enclosing the data in feature space. Points outside the sphere are anomalies.

One-Class SVM Objective $$\min_{R,\omega,\boldsymbol{\xi}}\;R^2 + C\sum_i\xi_i \qquad \text{s.t.}\quad \|\phi(\mathbf{x}_i)-\omega\|^2 \leq R^2 + \xi_i,\;\xi_i \geq 0$$

$\omega$ is the center, $R$ is the radius in feature space. A test point $\mathbf{x}$ is normal if $\|\phi(\mathbf{x})-\omega\|^2 \leq R^2$, otherwise it is flagged as an anomaly.

Next Lecture — 28 April

Combining Methods (Ensembles). No single model is best for all problems (No Free Lunch Theorem). We learn how to combine many weak learners into a strong one — through voting, bagging, boosting, stacking, and mixture of experts.