Support Vector Machines &
Kernel Methods
SVMs find the hyperplane that maximizes the gap between classes — deriving this geometrically, formulating it as a constrained optimization problem, solving it via Lagrangian duality, and extending to nonlinear boundaries through the kernel trick.
From Linear Classifiers to SVMs
A simple linear classifier $f(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b$ separates the space with a hyperplane. In $d$ dimensions, this boundary is:
1D
A threshold point: $\hat{y} = +1$ if $x \geq t$, else $-1$.
2D
A line: $w_1x_1 + w_2x_2 = c$. One side is $+1$, the other $-1$.
$d$D
A hyperplane: $\mathbf{w}^T\mathbf{x} + b = 0$. $\mathbf{w}$ is its normal vector.
Many hyperplanes may separate the data equally well. Which one to choose? A purely geometric boundary ignores margin and noise — it will generalize poorly. SVMs solve this by finding the maximum-margin separator.
Margin and Distance to the Hyperplane
The margin is the perpendicular distance between the decision boundary and the nearest training points from each class. A larger margin means the classifier is more confident and generalizes better.
The perpendicular distance from any point $\mathbf{A}$ to the hyperplane $\mathbf{w}^T\mathbf{x} + b = 0$ is:
Why the Margin is $2/\|\mathbf{w}\|$
Let $\mathbf{x}^+$ be on $\mathbf{w}^T\mathbf{x} + b = +1$ and $\mathbf{x}^-$ on $\mathbf{w}^T\mathbf{x} + b = -1$. Subtracting gives $\mathbf{w}^T(\mathbf{x}^+ - \mathbf{x}^-) = 2$, so the perpendicular distance between the two margin hyperplanes is $\frac{2}{\|\mathbf{w}\|}$.
Hard-Margin SVM — Primal Form
Assume the data is linearly separable. We want to find $(\mathbf{w}, b)$ that maximizes the margin while correctly classifying all training points.
- Minimizing $\frac{1}{2}\|\mathbf{w}\|^2$ is equivalent to maximizing the margin $\frac{2}{\|\mathbf{w}\|}$
- The constraint $y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1$ ensures all points lie on or outside the correct margin boundary
- The $\frac{1}{2}$ factor is for mathematical convenience when differentiating
The training points that satisfy $y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$ exactly — those that sit right on the margin boundary — are called support vectors. They are the only points that determine the decision boundary. Remove any non-support-vector and the solution doesn't change.
Lagrangian Construction & Dual Form
Introduce a Lagrange multiplier $\alpha_i \geq 0$ for each constraint:
KKT Conditions
Setting partial derivatives to zero gives us the dual variables:
Substituting back into $L$ gives the dual objective — a function of $\boldsymbol{\alpha}$ only:
The dual depends on training data only through dot products $\mathbf{x}_i \cdot \mathbf{x}_j$. This is the entry point for the kernel trick. Also, from the KKT complementarity condition: $\alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i+b)-1]=0$, so $\alpha_i > 0$ only for support vectors.
Reconstructing the Decision Function
Only support vectors (those with $\alpha_i^* > 0$) contribute to the decision function. All other training points can be discarded after training.
Soft-Margin SVM
Real data is rarely linearly separable. The soft-margin SVM allows some points to violate the margin constraints, penalizing violations proportionally.
$\xi_i = 0$
Point correctly classified beyond the margin. No penalty.
$0 < \xi_i < 1$
Inside the margin but on the correct side. Margin violation penalized by $C\xi_i$.
$\xi_i \geq 1$
Misclassified. Penalty $C\xi_i \geq C$.
The Role of Parameter C
$C$ controls the trade-off between maximizing the margin and minimizing constraint violations.
Large C
Heavily penalize misclassifications. Prioritize correctness over margin width. Narrower margin, harder boundary. Risk: overfitting to noise.
Small C
Allow more margin violations. Prioritize a wider margin. Potentially better generalization on noisy data. Risk: underfitting.
$C$ is a hyperparameter — tune via cross-validation, typically on a log scale: $C \in \{10^{-3}, 10^{-2}, \ldots, 10^3\}$.
Soft-Margin Dual
The Lagrangian for the soft-margin adds multipliers $\mu_i \geq 0$ for the slack constraints. After applying KKT conditions, the dual form is almost identical to the hard-margin case, but with a box constraint on $\alpha_i$:
The only difference from the hard-margin dual: $\alpha_i$ is now bounded above by $C$ (the box constraint). The decision function is identical. Support vectors are those with $0 < \alpha_i \leq C$.
The Kernel Trick
The SVM dual depends on data only through dot products $\mathbf{x}_i \cdot \mathbf{x}_j$. The kernel trick replaces these with a kernel function that implicitly computes dot products in a higher-dimensional space — without ever explicitly computing the feature map $\phi(\mathbf{x})$.
$k(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x}) \cdot \phi(\mathbf{z})$ for some (possibly infinite-dimensional) feature map $\phi$. We can compute $k$ directly without knowing $\phi$.
A polynomial kernel of degree 2 in 2D implicitly works in a 6-dimensional feature space. An RBF kernel works in an infinite-dimensional space. But computing the kernel is just a single dot product operation — $O(d)$ — regardless of the implicit dimension.
Polynomial Kernel — Example
Computing $k(\mathbf{x},\mathbf{y})$ is just one scalar computation, but it equals $\phi(\mathbf{x}) \cdot \phi(\mathbf{y})$ — the dot product of 6-dimensional vectors. We get the expressiveness of 6D for the cost of 2D.
Kernel Functions in Practice
Polynomial
$k(\mathbf{x},\mathbf{z}) = (\mathbf{x}^T\mathbf{z} + 1)^q$Degree $q$ controls boundary complexity. $q=1$ gives linear SVM. Works well for text classification, image recognition.
RBF / Gaussian
$k(\mathbf{x},\mathbf{z}) = e^{-\gamma\|\mathbf{x}-\mathbf{z}\|^2}$Infinite-dimensional feature space. Decays with distance — local similarity. $\gamma$ controls width. Most popular kernel in practice.
Kernel Engineering
Strings, graphs, imagesKernels can be defined for any structured data type — as long as they compute a valid similarity (positive semi-definite). Enables SVM for non-vectorial inputs.
Combining Kernels
Any positive linear combination, product, or scaling of valid kernels is itself a valid kernel. The adaptive kernel $\sum_i \eta_i(\mathbf{x}|\theta) k_i$ weights kernels as a function of the input — enabling localized flexibility.
Kernelized SVM Dual
Simply replace every dot product $\mathbf{x}_i \cdot \mathbf{x}_j$ in the soft-margin dual with $k(\mathbf{x}_i, \mathbf{x}_j)$:
In the original input space, the decision boundary is now a (possibly curved) hypersurface. In the implicit feature space $\phi$, it remains a linear hyperplane — the SVM stays linear, we just changed the space. This is why SVMs with kernels can have universal approximation capacity.
Multiclass SVM & One-Class SVM
Multiclass Strategies
- One-vs-All (OvR) — train $K$ binary SVMs, one per class against all others. Classify by the class with highest confidence score.
- One-vs-One (OvO) — train $\frac{K(K-1)}{2}$ binary SVMs, one per pair of classes. Classify by majority vote.
- Single multiclass optimization — minimize $\frac{1}{2}\sum_k \|\mathbf{w}_k\|^2 + C\sum_i\sum_k \xi_i^k$ directly. Harder to solve but theoretically cleaner.
One-Class SVM — Anomaly Detection
Train on only positive/normal examples. Learn the tightest hypersphere enclosing the data in feature space. Points outside the sphere are anomalies.
$\omega$ is the center, $R$ is the radius in feature space. A test point $\mathbf{x}$ is normal if $\|\phi(\mathbf{x})-\omega\|^2 \leq R^2$, otherwise it is flagged as an anomaly.
Combining Methods (Ensembles). No single model is best for all problems (No Free Lunch Theorem). We learn how to combine many weak learners into a strong one — through voting, bagging, boosting, stacking, and mixture of experts.