Lecture 7 · Chapter 8 · 14 April

Lazy Learners &
Density Estimation

Instead of building an explicit model during training, lazy learners defer all computation to prediction time — looking up the most similar training examples to answer each query. We examine the theory, algorithms, and scalability solutions that make this practical.

Authors: Seixas Junior, Mahmud & Koren
Key algorithms: k-NN · KDE · LSH
Date: 14 April

00 · Core Distinction

Lazy vs. Eager Learning

Lazy Learning

Postpones generalization until a query is made. Training stores examples; prediction computes similarities on the fly.

Property	Lazy (Instance-Based)	Eager (Model-Based)
When it generalizes	At prediction time	During training
Training cost	Near zero — just store data	High — build model
Prediction cost	High — search all stored data	Low — evaluate model
Memory	Entire training set	Model parameters only
Flexibility	Adapts to local patterns naturally	Fixed inductive bias
Examples	k-NN, kernel regression	Decision trees, SVM, neural nets

01 · Classification

Parametric vs. Non-Parametric

📐

Parametric Methods

Assume a fixed functional form (e.g., Gaussian, linear). Summarize data in a fixed number of parameters. Analogy: following a recipe — predict house price using learned coefficients for square footage and bedrooms.

🔎

Non-Parametric Methods

No fixed form — model complexity grows with data. Analogy: comparing similar houses — find the $k$ most similar houses and average their prices. The data itself is the model.

Key Connection

Non-parametric methods are typically lazy learners — because without a compact model, you need the raw data at prediction time. Parametric methods are typically eager — once parameters are estimated, raw data can be discarded.

02 · Density Estimation

Density Estimation

Given i.i.d. samples $\mathcal{X} = \{x^t\}_{t=1}^N$ from an unknown distribution $p(x)$, estimate $p(x)$ without assuming a parametric form.

Parametric

Assume a specific form (e.g., Gaussian). Fit parameters via MLE. Fast prediction. Fails if assumption is wrong.

Non-Parametric

Minimal assumptions. Examples: histogram, KDE, k-NN estimator. Flexible but computationally intensive.

Semi-Parametric

Combine a known parametric structure with non-parametric adjustments. E.g., Gaussian mixture models with learned components.

Empirical CDF and Density $$\hat{F}(x) = \frac{\#\{x^t \leq x\}}{N} \qquad \hat{p}(x) = \frac{1}{h}\left[\frac{\#\{x^t \leq x+h\} - \#\{x^t \leq x\}}{N}\right]$$

03 · Estimators

Histogram, Naive & Kernel Estimators

Histogram

Divide the range into fixed bins of width $h$. Count examples per bin, normalize by $Nh$. Simple but discontinuous — density is constant within each bin. $\hat{p} = \frac{\#\{x^t \text{ in same bin as } x\}}{Nh}$

Naive Estimator

A sliding window of width $h$ centered at the query point $x$: $\hat{p}(x) = \frac{\#\{x^t \in (x-h/2, x+h/2]\}}{Nh}$. Equivalent to histogram with a moving center. Still discontinuous.

Kernel Estimator (KDE)

Replace the hard window with a smooth weight function (kernel). Each data point contributes a bell curve: $\hat{p}(x) = \frac{1}{Nh}\sum_{t=1}^{N} K\!\left(\frac{x-x^t}{h}\right)$. With Gaussian kernel: $K(u) = \frac{1}{\sqrt{2\pi}}e^{-u^2/2}$. Produces a smooth density estimate.

Bandwidth $h$ — The Critical Hyperparameter

Small $h$: spiky, high-variance estimate that overfits. Large $h$: over-smoothed, high-bias estimate that misses true peaks. Bandwidth selection (rule-of-thumb, cross-validation) is the central challenge of KDE.

04 · Adaptive Width

Nearest Neighbors Density Estimator

Instead of a fixed bandwidth $h$, the NN estimator adapts the window size to local data density: use the distance to the $k$-th nearest neighbor as the radius.

NN Density Estimator $$\hat{p}(x) = \frac{k}{2N\,d_k(x)} \qquad \text{(smooth variant: } \hat{p}(x) = \frac{1}{N\,d_k(x)}\sum_{t=1}^{N} K\!\left(\frac{x-x^t}{d_k(x)}\right)\text{)}$$

Dense regions — $d_k(x)$ is small → narrow window → high estimated density
Sparse regions — $d_k(x)$ is large → wide window → lower estimated density
Advantage over KDE — automatically adapts to local density; no need to tune a global bandwidth

05 · High Dimensions

Multivariate KDE & the Curse of Dimensionality

Extending KDE to $d$ dimensions requires a multivariate kernel. The Gaussian kernel generalizes naturally:

Multivariate KDE $$\hat{p}(\mathbf{x}) = \frac{1}{Nh^d}\sum_{t=1}^{N} K\!\left(\frac{\mathbf{x}-\mathbf{x}^t}{h}\right), \qquad K(\mathbf{u}) = \left(\frac{1}{\sqrt{2\pi}}\right)^d e^{-\|\mathbf{u}\|^2/2}$$

The Curse of Dimensionality

In high dimensions, every point becomes equidistant from every other point — distance loses meaning. With 10 bins per dimension in an 8-dimensional space, there are $10^8 = 100\,000\,000$ bins. Even 1 million samples leave most bins empty. Density estimation (and k-NN) degrades rapidly as $d$ grows.

Data points tend to concentrate near the surface of the hypersphere — not the center
All pairwise distances converge to the same value as $d \to \infty$
The class-probability-constant neighborhood assumption breaks down for large $k$
Solutions: dimensionality reduction (PCA, autoencoders), feature selection, manifold learning

06 · Classification

k-Nearest Neighbors (k-NN)

Algorithm

To classify a new point $\mathbf{x}$: compute distances to all training points, identify the $k$ nearest, assign the majority class among them.

k-NN Neighborhood $$d_1(\mathbf{x}) \leq d_2(\mathbf{x}) \leq \cdots \leq d_N(\mathbf{x}), \qquad h = 2\,d_k(\mathbf{x})$$

🔢

Choosing $k$

Use odd $k$ for binary classification (breaks ties). Tune via cross-validation — plot validation error vs. $k$ and choose the elbow. Small $k$: overfitting. Large $k$: oversmoothing.

📏

Distance metric

Euclidean ($\ell_2$) is default. Manhattan ($\ell_1$) for high-$d$. Cosine for text. Hamming for categorical. The choice critically affects the neighborhood shape.

🏆

Tie-breaking

When votes are tied: use the closest neighbor's class, use weighted voting (weight by $1/d$), or use random selection.

Non-Parametric Classification

k-NN naturally estimates class-conditional densities: $\hat{p}(\mathbf{x} \mid C_i) = \frac{1}{N_i h^d}\sum_t K\!\left(\frac{\mathbf{x}-\mathbf{x}^t}{h}\right) r_i^t$ where $r_i^t = 1$ iff $\mathbf{x}^t \in C_i$. The discriminant function is then $g_i(\mathbf{x}) = \hat{p}(\mathbf{x} \mid C_i)\,\hat{P}(C_i)$.

07 · Worked Example

k-NN Example: Political Affiliation ($k = 3$)

Test instance: female, young, rich. Distance metric: count of mismatching attribute values (0 = same, 1 = different).

#	Gender	Age	Wealth	Politics	Distance
1	male	middle-aged	rich	Right-wing	2 (gender, age differ)
2	male	young	rich	Right-wing	1 (gender differs)
3	female	young	poor	Left-wing	1 (wealth differs)
4	female	middle-aged	poor	Left-wing	2 (age, wealth differ)
5	male	young	poor	Right-wing	2 (gender, wealth differ)
6	male	old	poor	Right-wing	3 (all differ)

3 nearest neighbors: instances 2, 3, and either 1 or 4 (tied at distance 2). Among instances 2 (Right), 3 (Left), and e.g. 1 (Right): majority vote → Right-wing.

08 · Improvements

Adaptive Nearest Neighbor Methods

Standard k-NN uses Euclidean distance, treating all dimensions equally. Adaptive NN learns a Mahalanobis distance metric that stretches and rotates the neighborhood based on local data covariance.

Adaptive (Mahalanobis) Distance $$D(\mathbf{x}, \mathbf{x}_0) = (\mathbf{x} - \mathbf{x}_0)^T \Sigma (\mathbf{x} - \mathbf{x}_0), \qquad \Sigma = W^{-1/2}[B^* + \varepsilon I]W^{-1/2}$$

$W$ — weight matrix scaling the features (often class-based covariance)
$B^*$ — local between-class covariance matrix
$\varepsilon I$ — small regularization term for numerical stability

Limitations

Computing $W^{-1/2}$ and performing matrix operations is expensive in high dimensions. Local covariance estimates are unstable in sparse regions. The method's performance is highly sensitive to parameter tuning.

09 · Compression

Prototyping

k-NN requires computing distances to every training point at prediction time — $O(Nd)$ per query. Prototyping replaces the full training set with a compact set of representative points.

📍

k-Means Prototypes

Run k-Means within each class to find $K$ cluster centers per class. Use these as prototypes. Fast — but cluster centers may end up near class boundaries, hurting accuracy. Other classes don't influence prototype placement.

🌊

Gaussian Mixture Models

Fit a GMM per class using EM (soft clustering). Prototypes are the mixture component means. Soft clustering accounts for uncertainty near boundaries, placing prototypes more representatively.

Trade-off

More prototypes → better accuracy but slower prediction. Fewer prototypes → faster but less accurate. The optimal number is found by cross-validation. Prototyping is essentially a form of model compression for lazy learners.

10 · Scalability

Locality Sensitive Hashing (LSH)

Even with prototyping, exact k-NN search is $O(Nd)$. LSH enables approximate nearest neighbor search in sublinear time by hashing similar points into the same bucket with high probability.

Standard Hashing

Designed for fast lookup with uniform distribution. Similar inputs produce completely different hash values (by design — for security). MD5("apple") ≠ MD5("apricot") even though both are fruits.

Locality Sensitive Hashing

Hash functions tailored to a distance metric so that similar items land in the same bucket with high probability. Used for approximate nearest neighbor search. Balances false positives vs. false negatives via parameter tuning.

Random Projection Hashing (Euclidean)

LSH Hash Function — Random Projection $$h_{p,b}(\mathbf{x}) = \left\lfloor \frac{\mathbf{p} \cdot \mathbf{x} + b}{w} \right\rfloor$$

where $\mathbf{p}$ is a random direction vector, $b$ is a random offset, and $w$ is the bucket width. Similar vectors have similar projections and thus land in the same bucket.

LSH Algorithm

For each of $L$ hash tables: retrieve all points from the same bucket as the query
Compute exact distances to candidate points from all tables
Return the $k$ nearest candidates found

The Win

Instead of comparing the query to all $N$ training points, LSH compares it only to points in the same bucket(s) — typically $O(\sqrt{N})$ or better. The approximation quality is controlled by the number of hash tables $L$ and the number of hash functions per table $k$.

Next Lecture — 17 April

Support Vector Machines & Kernel Methods. We move to the maximum-margin classifier — deriving the hard and soft margin SVM from first principles, formulating the Lagrangian dual, and extending to nonlinear boundaries via the kernel trick.

Lazy Learners &Density Estimation

Lazy vs. Eager Learning

Parametric vs. Non-Parametric

Parametric Methods

Non-Parametric Methods

Density Estimation

Parametric

Non-Parametric

Semi-Parametric

Histogram, Naive & Kernel Estimators

Nearest Neighbors Density Estimator

Multivariate KDE & the Curse of Dimensionality

k-Nearest Neighbors (k-NN)

Choosing $k$

Distance metric

Tie-breaking

k-NN Example: Political Affiliation ($k = 3$)

Adaptive Nearest Neighbor Methods

Prototyping

k-Means Prototypes

Gaussian Mixture Models

Locality Sensitive Hashing (LSH)

Standard Hashing

Locality Sensitive Hashing

Random Projection Hashing (Euclidean)

LSH Algorithm

Lazy Learners &
Density Estimation