Lazy Learners &
Density Estimation
Instead of building an explicit model during training, lazy learners defer all computation to prediction time — looking up the most similar training examples to answer each query. We examine the theory, algorithms, and scalability solutions that make this practical.
Lazy vs. Eager Learning
Postpones generalization until a query is made. Training stores examples; prediction computes similarities on the fly.
| Property | Lazy (Instance-Based) | Eager (Model-Based) |
|---|---|---|
| When it generalizes | At prediction time | During training |
| Training cost | Near zero — just store data | High — build model |
| Prediction cost | High — search all stored data | Low — evaluate model |
| Memory | Entire training set | Model parameters only |
| Flexibility | Adapts to local patterns naturally | Fixed inductive bias |
| Examples | k-NN, kernel regression | Decision trees, SVM, neural nets |
Parametric vs. Non-Parametric
Parametric Methods
Assume a fixed functional form (e.g., Gaussian, linear). Summarize data in a fixed number of parameters. Analogy: following a recipe — predict house price using learned coefficients for square footage and bedrooms.
Non-Parametric Methods
No fixed form — model complexity grows with data. Analogy: comparing similar houses — find the $k$ most similar houses and average their prices. The data itself is the model.
Non-parametric methods are typically lazy learners — because without a compact model, you need the raw data at prediction time. Parametric methods are typically eager — once parameters are estimated, raw data can be discarded.
Density Estimation
Given i.i.d. samples $\mathcal{X} = \{x^t\}_{t=1}^N$ from an unknown distribution $p(x)$, estimate $p(x)$ without assuming a parametric form.
Parametric
Assume a specific form (e.g., Gaussian). Fit parameters via MLE. Fast prediction. Fails if assumption is wrong.
Non-Parametric
Minimal assumptions. Examples: histogram, KDE, k-NN estimator. Flexible but computationally intensive.
Semi-Parametric
Combine a known parametric structure with non-parametric adjustments. E.g., Gaussian mixture models with learned components.
Histogram, Naive & Kernel Estimators
Small $h$: spiky, high-variance estimate that overfits. Large $h$: over-smoothed, high-bias estimate that misses true peaks. Bandwidth selection (rule-of-thumb, cross-validation) is the central challenge of KDE.
Nearest Neighbors Density Estimator
Instead of a fixed bandwidth $h$, the NN estimator adapts the window size to local data density: use the distance to the $k$-th nearest neighbor as the radius.
- Dense regions — $d_k(x)$ is small → narrow window → high estimated density
- Sparse regions — $d_k(x)$ is large → wide window → lower estimated density
- Advantage over KDE — automatically adapts to local density; no need to tune a global bandwidth
Multivariate KDE & the Curse of Dimensionality
Extending KDE to $d$ dimensions requires a multivariate kernel. The Gaussian kernel generalizes naturally:
In high dimensions, every point becomes equidistant from every other point — distance loses meaning. With 10 bins per dimension in an 8-dimensional space, there are $10^8 = 100\,000\,000$ bins. Even 1 million samples leave most bins empty. Density estimation (and k-NN) degrades rapidly as $d$ grows.
- Data points tend to concentrate near the surface of the hypersphere — not the center
- All pairwise distances converge to the same value as $d \to \infty$
- The class-probability-constant neighborhood assumption breaks down for large $k$
- Solutions: dimensionality reduction (PCA, autoencoders), feature selection, manifold learning
k-Nearest Neighbors (k-NN)
To classify a new point $\mathbf{x}$: compute distances to all training points, identify the $k$ nearest, assign the majority class among them.
Choosing $k$
Use odd $k$ for binary classification (breaks ties). Tune via cross-validation — plot validation error vs. $k$ and choose the elbow. Small $k$: overfitting. Large $k$: oversmoothing.
Distance metric
Euclidean ($\ell_2$) is default. Manhattan ($\ell_1$) for high-$d$. Cosine for text. Hamming for categorical. The choice critically affects the neighborhood shape.
Tie-breaking
When votes are tied: use the closest neighbor's class, use weighted voting (weight by $1/d$), or use random selection.
k-NN naturally estimates class-conditional densities: $\hat{p}(\mathbf{x} \mid C_i) = \frac{1}{N_i h^d}\sum_t K\!\left(\frac{\mathbf{x}-\mathbf{x}^t}{h}\right) r_i^t$ where $r_i^t = 1$ iff $\mathbf{x}^t \in C_i$. The discriminant function is then $g_i(\mathbf{x}) = \hat{p}(\mathbf{x} \mid C_i)\,\hat{P}(C_i)$.
k-NN Example: Political Affiliation ($k = 3$)
Test instance: female, young, rich. Distance metric: count of mismatching attribute values (0 = same, 1 = different).
| # | Gender | Age | Wealth | Politics | Distance |
|---|---|---|---|---|---|
| 1 | male | middle-aged | rich | Right-wing | 2 (gender, age differ) |
| 2 | male | young | rich | Right-wing | 1 (gender differs) |
| 3 | female | young | poor | Left-wing | 1 (wealth differs) |
| 4 | female | middle-aged | poor | Left-wing | 2 (age, wealth differ) |
| 5 | male | young | poor | Right-wing | 2 (gender, wealth differ) |
| 6 | male | old | poor | Right-wing | 3 (all differ) |
3 nearest neighbors: instances 2, 3, and either 1 or 4 (tied at distance 2). Among instances 2 (Right), 3 (Left), and e.g. 1 (Right): majority vote → Right-wing.
Adaptive Nearest Neighbor Methods
Standard k-NN uses Euclidean distance, treating all dimensions equally. Adaptive NN learns a Mahalanobis distance metric that stretches and rotates the neighborhood based on local data covariance.
- $W$ — weight matrix scaling the features (often class-based covariance)
- $B^*$ — local between-class covariance matrix
- $\varepsilon I$ — small regularization term for numerical stability
Computing $W^{-1/2}$ and performing matrix operations is expensive in high dimensions. Local covariance estimates are unstable in sparse regions. The method's performance is highly sensitive to parameter tuning.
Prototyping
k-NN requires computing distances to every training point at prediction time — $O(Nd)$ per query. Prototyping replaces the full training set with a compact set of representative points.
k-Means Prototypes
Run k-Means within each class to find $K$ cluster centers per class. Use these as prototypes. Fast — but cluster centers may end up near class boundaries, hurting accuracy. Other classes don't influence prototype placement.
Gaussian Mixture Models
Fit a GMM per class using EM (soft clustering). Prototypes are the mixture component means. Soft clustering accounts for uncertainty near boundaries, placing prototypes more representatively.
More prototypes → better accuracy but slower prediction. Fewer prototypes → faster but less accurate. The optimal number is found by cross-validation. Prototyping is essentially a form of model compression for lazy learners.
Locality Sensitive Hashing (LSH)
Even with prototyping, exact k-NN search is $O(Nd)$. LSH enables approximate nearest neighbor search in sublinear time by hashing similar points into the same bucket with high probability.
Standard Hashing
Designed for fast lookup with uniform distribution. Similar inputs produce completely different hash values (by design — for security). MD5("apple") ≠ MD5("apricot") even though both are fruits.
Locality Sensitive Hashing
Hash functions tailored to a distance metric so that similar items land in the same bucket with high probability. Used for approximate nearest neighbor search. Balances false positives vs. false negatives via parameter tuning.
Random Projection Hashing (Euclidean)
where $\mathbf{p}$ is a random direction vector, $b$ is a random offset, and $w$ is the bucket width. Similar vectors have similar projections and thus land in the same bucket.
LSH Algorithm
- For each of $L$ hash tables: retrieve all points from the same bucket as the query
- Compute exact distances to candidate points from all tables
- Return the $k$ nearest candidates found
Instead of comparing the query to all $N$ training points, LSH compares it only to points in the same bucket(s) — typically $O(\sqrt{N})$ or better. The approximation quality is controlled by the number of hash tables $L$ and the number of hash functions per table $k$.
Support Vector Machines & Kernel Methods. We move to the maximum-margin classifier — deriving the hard and soft margin SVM from first principles, formulating the Lagrangian dual, and extending to nonlinear boundaries via the kernel trick.