Lecture 7 · Chapter 8 · 14 April

Lazy Learners &
Density Estimation

Instead of building an explicit model during training, lazy learners defer all computation to prediction time — looking up the most similar training examples to answer each query. We examine the theory, algorithms, and scalability solutions that make this practical.

Authors
Seixas Junior, Mahmud & Koren
Key algorithms
k-NN · KDE · LSH
Date
14 April
00 · Core Distinction

Lazy vs. Eager Learning

Lazy Learning

Postpones generalization until a query is made. Training stores examples; prediction computes similarities on the fly.

PropertyLazy (Instance-Based)Eager (Model-Based)
When it generalizesAt prediction timeDuring training
Training costNear zero — just store dataHigh — build model
Prediction costHigh — search all stored dataLow — evaluate model
MemoryEntire training setModel parameters only
FlexibilityAdapts to local patterns naturallyFixed inductive bias
Examplesk-NN, kernel regressionDecision trees, SVM, neural nets
01 · Classification

Parametric vs. Non-Parametric

📐

Parametric Methods

Assume a fixed functional form (e.g., Gaussian, linear). Summarize data in a fixed number of parameters. Analogy: following a recipe — predict house price using learned coefficients for square footage and bedrooms.

🔎

Non-Parametric Methods

No fixed form — model complexity grows with data. Analogy: comparing similar houses — find the $k$ most similar houses and average their prices. The data itself is the model.

Key Connection

Non-parametric methods are typically lazy learners — because without a compact model, you need the raw data at prediction time. Parametric methods are typically eager — once parameters are estimated, raw data can be discarded.

02 · Density Estimation

Density Estimation

Given i.i.d. samples $\mathcal{X} = \{x^t\}_{t=1}^N$ from an unknown distribution $p(x)$, estimate $p(x)$ without assuming a parametric form.

Parametric

Assume a specific form (e.g., Gaussian). Fit parameters via MLE. Fast prediction. Fails if assumption is wrong.

Non-Parametric

Minimal assumptions. Examples: histogram, KDE, k-NN estimator. Flexible but computationally intensive.

Semi-Parametric

Combine a known parametric structure with non-parametric adjustments. E.g., Gaussian mixture models with learned components.

Empirical CDF and Density $$\hat{F}(x) = \frac{\#\{x^t \leq x\}}{N} \qquad \hat{p}(x) = \frac{1}{h}\left[\frac{\#\{x^t \leq x+h\} - \#\{x^t \leq x\}}{N}\right]$$
03 · Estimators

Histogram, Naive & Kernel Estimators

Histogram
Divide the range into fixed bins of width $h$. Count examples per bin, normalize by $Nh$. Simple but discontinuous — density is constant within each bin. $\hat{p} = \frac{\#\{x^t \text{ in same bin as } x\}}{Nh}$
Naive Estimator
A sliding window of width $h$ centered at the query point $x$: $\hat{p}(x) = \frac{\#\{x^t \in (x-h/2, x+h/2]\}}{Nh}$. Equivalent to histogram with a moving center. Still discontinuous.
Kernel Estimator (KDE)
Replace the hard window with a smooth weight function (kernel). Each data point contributes a bell curve: $\hat{p}(x) = \frac{1}{Nh}\sum_{t=1}^{N} K\!\left(\frac{x-x^t}{h}\right)$. With Gaussian kernel: $K(u) = \frac{1}{\sqrt{2\pi}}e^{-u^2/2}$. Produces a smooth density estimate.
Bandwidth $h$ — The Critical Hyperparameter

Small $h$: spiky, high-variance estimate that overfits. Large $h$: over-smoothed, high-bias estimate that misses true peaks. Bandwidth selection (rule-of-thumb, cross-validation) is the central challenge of KDE.

04 · Adaptive Width

Nearest Neighbors Density Estimator

Instead of a fixed bandwidth $h$, the NN estimator adapts the window size to local data density: use the distance to the $k$-th nearest neighbor as the radius.

NN Density Estimator $$\hat{p}(x) = \frac{k}{2N\,d_k(x)} \qquad \text{(smooth variant: } \hat{p}(x) = \frac{1}{N\,d_k(x)}\sum_{t=1}^{N} K\!\left(\frac{x-x^t}{d_k(x)}\right)\text{)}$$
  • Dense regions — $d_k(x)$ is small → narrow window → high estimated density
  • Sparse regions — $d_k(x)$ is large → wide window → lower estimated density
  • Advantage over KDE — automatically adapts to local density; no need to tune a global bandwidth
05 · High Dimensions

Multivariate KDE & the Curse of Dimensionality

Extending KDE to $d$ dimensions requires a multivariate kernel. The Gaussian kernel generalizes naturally:

Multivariate KDE $$\hat{p}(\mathbf{x}) = \frac{1}{Nh^d}\sum_{t=1}^{N} K\!\left(\frac{\mathbf{x}-\mathbf{x}^t}{h}\right), \qquad K(\mathbf{u}) = \left(\frac{1}{\sqrt{2\pi}}\right)^d e^{-\|\mathbf{u}\|^2/2}$$
The Curse of Dimensionality

In high dimensions, every point becomes equidistant from every other point — distance loses meaning. With 10 bins per dimension in an 8-dimensional space, there are $10^8 = 100\,000\,000$ bins. Even 1 million samples leave most bins empty. Density estimation (and k-NN) degrades rapidly as $d$ grows.

  • Data points tend to concentrate near the surface of the hypersphere — not the center
  • All pairwise distances converge to the same value as $d \to \infty$
  • The class-probability-constant neighborhood assumption breaks down for large $k$
  • Solutions: dimensionality reduction (PCA, autoencoders), feature selection, manifold learning
06 · Classification

k-Nearest Neighbors (k-NN)

Algorithm

To classify a new point $\mathbf{x}$: compute distances to all training points, identify the $k$ nearest, assign the majority class among them.

k-NN Neighborhood $$d_1(\mathbf{x}) \leq d_2(\mathbf{x}) \leq \cdots \leq d_N(\mathbf{x}), \qquad h = 2\,d_k(\mathbf{x})$$
🔢

Choosing $k$

Use odd $k$ for binary classification (breaks ties). Tune via cross-validation — plot validation error vs. $k$ and choose the elbow. Small $k$: overfitting. Large $k$: oversmoothing.

📏

Distance metric

Euclidean ($\ell_2$) is default. Manhattan ($\ell_1$) for high-$d$. Cosine for text. Hamming for categorical. The choice critically affects the neighborhood shape.

🏆

Tie-breaking

When votes are tied: use the closest neighbor's class, use weighted voting (weight by $1/d$), or use random selection.

Non-Parametric Classification

k-NN naturally estimates class-conditional densities: $\hat{p}(\mathbf{x} \mid C_i) = \frac{1}{N_i h^d}\sum_t K\!\left(\frac{\mathbf{x}-\mathbf{x}^t}{h}\right) r_i^t$ where $r_i^t = 1$ iff $\mathbf{x}^t \in C_i$. The discriminant function is then $g_i(\mathbf{x}) = \hat{p}(\mathbf{x} \mid C_i)\,\hat{P}(C_i)$.

07 · Worked Example

k-NN Example: Political Affiliation ($k = 3$)

Test instance: female, young, rich. Distance metric: count of mismatching attribute values (0 = same, 1 = different).

#GenderAgeWealthPoliticsDistance
1malemiddle-agedrichRight-wing2 (gender, age differ)
2maleyoungrichRight-wing1 (gender differs)
3femaleyoungpoorLeft-wing1 (wealth differs)
4femalemiddle-agedpoorLeft-wing2 (age, wealth differ)
5maleyoungpoorRight-wing2 (gender, wealth differ)
6maleoldpoorRight-wing3 (all differ)

3 nearest neighbors: instances 2, 3, and either 1 or 4 (tied at distance 2). Among instances 2 (Right), 3 (Left), and e.g. 1 (Right): majority vote → Right-wing.

08 · Improvements

Adaptive Nearest Neighbor Methods

Standard k-NN uses Euclidean distance, treating all dimensions equally. Adaptive NN learns a Mahalanobis distance metric that stretches and rotates the neighborhood based on local data covariance.

Adaptive (Mahalanobis) Distance $$D(\mathbf{x}, \mathbf{x}_0) = (\mathbf{x} - \mathbf{x}_0)^T \Sigma (\mathbf{x} - \mathbf{x}_0), \qquad \Sigma = W^{-1/2}[B^* + \varepsilon I]W^{-1/2}$$
  • $W$ — weight matrix scaling the features (often class-based covariance)
  • $B^*$ — local between-class covariance matrix
  • $\varepsilon I$ — small regularization term for numerical stability
Limitations

Computing $W^{-1/2}$ and performing matrix operations is expensive in high dimensions. Local covariance estimates are unstable in sparse regions. The method's performance is highly sensitive to parameter tuning.

09 · Compression

Prototyping

k-NN requires computing distances to every training point at prediction time — $O(Nd)$ per query. Prototyping replaces the full training set with a compact set of representative points.

📍

k-Means Prototypes

Run k-Means within each class to find $K$ cluster centers per class. Use these as prototypes. Fast — but cluster centers may end up near class boundaries, hurting accuracy. Other classes don't influence prototype placement.

🌊

Gaussian Mixture Models

Fit a GMM per class using EM (soft clustering). Prototypes are the mixture component means. Soft clustering accounts for uncertainty near boundaries, placing prototypes more representatively.

Trade-off

More prototypes → better accuracy but slower prediction. Fewer prototypes → faster but less accurate. The optimal number is found by cross-validation. Prototyping is essentially a form of model compression for lazy learners.

10 · Scalability

Locality Sensitive Hashing (LSH)

Even with prototyping, exact k-NN search is $O(Nd)$. LSH enables approximate nearest neighbor search in sublinear time by hashing similar points into the same bucket with high probability.

Standard Hashing

Designed for fast lookup with uniform distribution. Similar inputs produce completely different hash values (by design — for security). MD5("apple") ≠ MD5("apricot") even though both are fruits.

Locality Sensitive Hashing

Hash functions tailored to a distance metric so that similar items land in the same bucket with high probability. Used for approximate nearest neighbor search. Balances false positives vs. false negatives via parameter tuning.

Random Projection Hashing (Euclidean)

LSH Hash Function — Random Projection $$h_{p,b}(\mathbf{x}) = \left\lfloor \frac{\mathbf{p} \cdot \mathbf{x} + b}{w} \right\rfloor$$

where $\mathbf{p}$ is a random direction vector, $b$ is a random offset, and $w$ is the bucket width. Similar vectors have similar projections and thus land in the same bucket.

LSH Algorithm

  • For each of $L$ hash tables: retrieve all points from the same bucket as the query
  • Compute exact distances to candidate points from all tables
  • Return the $k$ nearest candidates found
The Win

Instead of comparing the query to all $N$ training points, LSH compares it only to points in the same bucket(s) — typically $O(\sqrt{N})$ or better. The approximation quality is controlled by the number of hash tables $L$ and the number of hash functions per table $k$.


Next Lecture — 17 April

Support Vector Machines & Kernel Methods. We move to the maximum-margin classifier — deriving the hard and soft margin SVM from first principles, formulating the Lagrangian dual, and extending to nonlinear boundaries via the kernel trick.