Machine Learning &
Supervised Learning
A foundational survey — from the question of why learning is needed, through the statistical machinery of supervised learning, to the theoretical guarantees that bound what a learner can achieve.
Course Schedule
The first half of the semester builds theoretical foundations — from classical supervised learning through probabilistic models for sequential data.
| Date | Topic |
|---|---|
| 13 Feb | Plans · Supervised Learning refresh (Ch. 1–2)Today |
| 20 Feb | Bayesian Decision Theory (Ch. 3) |
| 27 Feb | Parametric Methods (Ch. 4) |
| 06 Mar | Probabilistic Graphical Models (Ch. 14) |
| 13 Mar | Hidden Markov Models (Ch. 15) |
| 20 Mar | Consultation / Review |
Two partial tests — mid-semester and end-of-semester — covering lecture and lab material. One opportunity to improve a partial grade. A written exam is available for those who don't pass via tests.
Why "Learn"?
Machine learning is programming computers to optimize a performance criterion using example data or past experience.
Not every problem requires learning. Calculating payroll follows deterministic rules — a formula suffices. Learning becomes necessary when the task is too complex, dynamic, or subjective for hand-crafted rules.
When Learning Is Necessary
No human expertise
Navigating Mars — no human has direct experience. The agent must develop its own policy from sensor data.
Inexplicable expertise
Speech recognition — humans decode speech effortlessly but cannot articulate the phoneme-to-meaning rules.
Changing solutions
Network routing — optimal paths shift constantly. A learned model adapts as the environment changes.
Personalization
User biometrics — every person is distinct. The system must adapt its model to individual-specific patterns.
What Is Machine Learning?
ML optimizes a performance criterion using example data or past experience. It sits at the intersection of two disciplines:
📊 Statistics
Inference from a sample. Provides the theoretical basis for generalizing from observations to population-level truths — handling uncertainty, distributions, and confidence.
💻 Computer Science
Efficient algorithms. Solving the optimization problem at scale, and representing + evaluating the model for fast inference.
The Data–Knowledge Trade-off
ML generalizes from particular examples → general models. Data is cheap and abundant; domain knowledge is expensive and scarce. ML inverts the traditional software paradigm.
Raw transaction logs → inferred consumer behavior: "People who bought Blink also bought Outliers." No human wrote that rule. The goal: build a model that is a good and useful approximation to the data.
Learning Paradigms
Association Learning
Discovering co-occurrence patterns without labels. Classic: market basket analysis.
Supervised Learning
Learning from labeled pairs $(\mathbf{x}, r)$. Covers classification (discrete) and regression (continuous).
Unsupervised Learning
Finding hidden structure in unlabeled data — clustering, density estimation, dimensionality reduction.
Reinforcement Learning
Optimal behavior from reward signals via environment interaction. Sequential decision-making.
Learning a Class from Examples
The canonical setup: identify which cars belong to the class "family car." This toy problem illustrates every core concept in supervised classification.
Input features
Each car $\mathbf{x}$ has two attributes:
$x_1$ = price · $x_2$ = engine power
Output labels
$r = +1$ → family car (positive)
$r = 0\;\;$ → not a family car (negative)
The Training Set
Two Goals of Classification
- Prediction — Given a new car $\mathbf{x}$, is it a family car?
- Knowledge extraction — What features characterize a family car? (interpretability)
Hypothesis Space & Version Space
Concept Class $\mathcal{C}$
The set of possible true labeling functions — the ground truth the learner approximates:
Hypothesis Class $\mathcal{H}$
The functions the learner can actually output — constrained by architecture. A hypothesis $h \in \mathcal{H}$ assigns predicted labels:
Empirical Error
S, G, and the Version Space
Most Specific (S)
Tightest boundary still consistent with all positive training examples.
Most General (G)
Broadest consistent boundary — everything positive except confirmed negatives.
All hypotheses $h \in \mathcal{H}$ between S and G with zero training error. Any $h$ in the version space is consistent with all available evidence. (Mitchell, 1997)
Margin
When many consistent hypotheses exist, choose the one whose decision boundary is farthest from every training point — the maximum-margin hypothesis. A larger margin means less sensitivity to small perturbations, leading to better generalization. This is the intuition behind Support Vector Machines.
VC Dimension
The VC dimension measures the capacity of a hypothesis class — how expressive is $\mathcal{H}$, and can it memorize any labeling of $N$ points?
Shattering
$\mathcal{H}$ shatters a set of $N$ points if, for every binary labeling of those points ($2^N$ possibilities), some $h \in \mathcal{H}$ achieves zero error.
$\text{VC}(\mathcal{H})$ = the largest $N$ such that $\mathcal{H}$ can shatter some set of $N$ points.
An axis-aligned rectangle in $\mathbb{R}^2$ can shatter exactly 4 points. No 5-point configuration can be shattered. Therefore $\text{VC}(\text{axis-aligned rectangles}) = 4$.
$\mathcal{H}$ is PAC-learnable if and only if its VC dimension is finite. VC dimension is the key measure of learnability.
PAC Learning
Probably Approximately Correct (PAC) learning asks: how many training examples do we need to learn reliably?
$\mathcal{H}$ is PAC-learnable if a learner can, using polynomially many samples, output a hypothesis with error $\leq \varepsilon$ with probability $\geq 1 - \delta$.
Reading the Bound
- Higher VC dimension $d$ → more expressive class → more samples needed
- Smaller $\varepsilon$ (tighter error tolerance) → more samples needed
- Smaller $\delta$ (higher confidence required) → more samples needed
Concept Class vs. Hypothesis Class
Concept Class $\mathcal{C}$
The possible true labeling functions — what actually governs the world. (e.g., real consumer definition of "family car")
Hypothesis Class $\mathcal{H}$
Functions the learner can output — constrained by model architecture. (e.g., axis-aligned rectangles)
Realizable vs. Agnostic
Realizable PAC
$\mathcal{C} \subseteq \mathcal{H}$. The true concept lies inside the hypothesis class. Zero training error is achievable in principle.
Agnostic PAC
No assumption about $\mathcal{C}$ vs. $\mathcal{H}$. Goal: find $h \in \mathcal{H}$ competitive with the best possible hypothesis.
Geometric Intuition (Rectangle Case)
When learning a rectangle near the true concept boundary, thin error strips may be missed. For the learned hypothesis to be $\varepsilon$-accurate:
- Each error strip has probability mass at most $\varepsilon / 4$
- Probability of $N$ examples missing one strip: $(1 - \varepsilon/4)^N$
- Over all 4 strips: $4(1 - \varepsilon/4)^N \leq \delta$
- Using $(1-x) \leq e^{-x}$: requires $N \geq \tfrac{4}{\varepsilon}\log\tfrac{4}{\delta}$
Agnostic Learning
In practice, the realizable assumption ($\mathcal{C} \subseteq \mathcal{H}$) almost never holds. ML is almost always operating in the agnostic setting.
Three Reasons ML Is Agnostic
Unknown true function
We don't know the true labeling function, the true distribution, or whether labels are noise-free.
Messy real data
Label noise, measurement errors, missing features, unobserved confounders, non-stationarity, adversarial samples.
Imperfect model classes
Linear models can't fit nonlinear truth. Decision trees can't represent smooth boundaries. There is always approximation error.
Circle vs. rectangles
True concept is a circle; $\mathcal{H}$ = axis-aligned rectangles. $\mathcal{C} \not\subseteq \mathcal{H}$ — realizable assumption fails.
When the realizable assumption fails: zero training error is unachievable, realizable PAC bounds don't apply, and the learner converges to the best approximation within $\mathcal{H}$ — not the true concept. The irreducible gap is called approximation error.
Noise and Model Complexity
Sources of Noise
- Labeling errors — annotators make mistakes; crowdsourced labels are inconsistent.
- Measurement errors — sensors are imprecise; recorded values deviate from true values.
- Latent factors — unmeasured variables influence the output; the model is always misspecified.
Occam's Razor — Prefer Simpler Models
When multiple consistent hypotheses fit the data equally well, prefer the simpler one:
Lower computational cost
Faster inference. Deployable on constrained devices. Cheaper to serve at scale.
Easier to train
Fewer parameters → faster convergence, lower space complexity, less sensitivity to initialization.
More interpretable
Easier to inspect, explain, and audit — critical in high-stakes applications.
Better generalization
Lower variance — less sensitive to the specific random training sample. Occam's Razor in action.
Multiple Classes & Regression
Multiple Classes
Binary classification extends to $K$ classes $\{C_1, \ldots, C_K\}$ using one-hot labels. Train one hypothesis per class (one-vs-rest):
Regression
When the output $r^t \in \mathbb{R}$ rather than a discrete label, the problem is regression. Assumed data model:
Linear: $g(x) = w_1 x + w_0$
Closed-form solution. One hyperplane in feature space.
Polynomial: $g(x) = w_2 x^2 + w_1 x + w_0$
More flexible, but risks overfitting with limited data.
Model Selection & Generalization
Data alone is insufficient to identify a unique solution — many consistent hypotheses may fit equally well. Without additional assumptions, learning is impossible.
Inductive Bias
The set of assumptions that allow the learner to select one hypothesis among many consistent ones. Every algorithm encodes inductive bias:
- Restricting $\mathcal{H}$ to axis-aligned rectangles
- Choosing the maximum-margin hypothesis (SVM bias)
- Assuming linearity (linear regression bias)
- Minimizing MSE as the loss function
Overfitting vs. Underfitting
Overfitting
$\mathcal{H}$ more complex than needed. Memorizes training noise. Low training error, high test error.
Underfitting
$\mathcal{H}$ too simple. Fails to capture real patterns. High training error and high test error.
The Triple Trade-Off — Dietterich, 2003
An unavoidable interplay between three factors. You cannot optimize all three simultaneously:
Model Complexity
$c(\mathcal{H})$ — expressiveness of the hypothesis class
Training Set Size
$N$ — labeled examples available
Generalization Error
$E$ — performance on unseen data. The ultimate measure.
As $N \uparrow$, error $E \downarrow$. As complexity $\uparrow$, error first $\downarrow$ (escaping underfitting) then $\uparrow$ (overfitting). There is an optimal complexity — the sweet spot where the model is expressive enough without memorizing noise.
Cross-Validation
To honestly estimate generalization, evaluate on data the model has never seen during training. A model evaluated only on its training data always looks better than it really is.
Training Set — 50%
Fits model parameters. The algorithm sees this data directly during optimization.
Validation Set — 25%
Model selection — choosing hyperparameters and architecture. Touched iteratively.
Test Set — 25%
Touched exactly once at the end. Reports honest final performance. Never used for decisions.
k-Fold CV
When data is scarce: rotate $k$ folds, train on $k{-}1$, validate on the rest. Uses every data point.
Dimensions of a Supervised Learner
Every supervised learning algorithm — from linear regression to large neural networks — is fully characterized by three design choices:
1 · The Model
The parametric family $g(\mathbf{x} \mid \theta)$. What shapes of functions can the learner represent?
2 · The Loss Function
How is error measured?
$E(\theta|\mathcal{X}) = \sum_t L(r^t, g(\mathbf{x}^t|\theta))$
3 · The Optimizer
How is the best $\theta$ found?
$\theta^* = \arg\min_\theta\; E(\theta|\mathcal{X})$
The Universal Template
When encountering any new ML algorithm, ask: what is the model, the loss, and the optimization procedure?
Bayesian Decision Theory (Ch. 3). Moving from empirical error minimization to a probabilistic framework — given a prior over classes and a likelihood model, what is the optimal decision boundary?