Lecture 1–2 · Machine Learning & Supervised Learning

00 · Overview

Course Schedule

The first half of the semester builds theoretical foundations — from classical supervised learning through probabilistic models for sequential data.

Date	Topic
13 Feb	Plans · Supervised Learning refresh (Ch. 1–2)Today
20 Feb	Bayesian Decision Theory (Ch. 3)
27 Feb	Parametric Methods (Ch. 4)
06 Mar	Probabilistic Graphical Models (Ch. 14)
13 Mar	Hidden Markov Models (Ch. 15)
20 Mar	Consultation / Review

Assessment

Two partial tests — mid-semester and end-of-semester — covering lecture and lab material. One opportunity to improve a partial grade. A written exam is available for those who don't pass via tests.

01 · Motivation

Why "Learn"?

Core Definition

Machine learning is programming computers to optimize a performance criterion using example data or past experience.

Not every problem requires learning. Calculating payroll follows deterministic rules — a formula suffices. Learning becomes necessary when the task is too complex, dynamic, or subjective for hand-crafted rules.

When Learning Is Necessary

🚀

No human expertise

Navigating Mars — no human has direct experience. The agent must develop its own policy from sensor data.

🗣️

Inexplicable expertise

Speech recognition — humans decode speech effortlessly but cannot articulate the phoneme-to-meaning rules.

🔄

Changing solutions

Network routing — optimal paths shift constantly. A learned model adapts as the environment changes.

🧬

Personalization

User biometrics — every person is distinct. The system must adapt its model to individual-specific patterns.

02 · Definition

What Is Machine Learning?

ML optimizes a performance criterion using example data or past experience. It sits at the intersection of two disciplines:

📊 Statistics

Inference from a sample. Provides the theoretical basis for generalizing from observations to population-level truths — handling uncertainty, distributions, and confidence.

💻 Computer Science

Efficient algorithms. Solving the optimization problem at scale, and representing + evaluating the model for fast inference.

The Data–Knowledge Trade-off

ML generalizes from particular examples → general models. Data is cheap and abundant; domain knowledge is expensive and scarce. ML inverts the traditional software paradigm.

The Amazon Example

Raw transaction logs → inferred consumer behavior: "People who bought Blink also bought Outliers." No human wrote that rule. The goal: build a model that is a good and useful approximation to the data.

03 · Taxonomy

Learning Paradigms

🔗

Association Learning

Discovering co-occurrence patterns without labels. Classic: market basket analysis.

🎯

Supervised Learning

Learning from labeled pairs $(\mathbf{x}, r)$. Covers classification (discrete) and regression (continuous).

🔍

Unsupervised Learning

Finding hidden structure in unlabeled data — clustering, density estimation, dimensionality reduction.

🎮

Reinforcement Learning

Optimal behavior from reward signals via environment interaction. Sequential decision-making.

04 · Supervised Learning

Learning a Class from Examples

The canonical setup: identify which cars belong to the class "family car." This toy problem illustrates every core concept in supervised classification.

Input features

Each car $\mathbf{x}$ has two attributes:
$x_1$ = price · $x_2$ = engine power

Output labels

$r = +1$ → family car (positive)
$r = 0\;\;$ → not a family car (negative)

The Training Set

Training Set $$\mathcal{X} = \{\mathbf{x}^t,\, r^t\}_{t=1}^{N}$$

Two Goals of Classification

Prediction — Given a new car $\mathbf{x}$, is it a family car?
Knowledge extraction — What features characterize a family car? (interpretability)

05 · Theory

Hypothesis Space & Version Space

Concept Class $\mathcal{C}$

The set of possible true labeling functions — the ground truth the learner approximates:

True Concept — Family Car $$C: (p_1 \leq \text{price} \leq p_2) \;\wedge\; (e_1 \leq \text{engine power} \leq e_2)$$

Hypothesis Class $\mathcal{H}$

The functions the learner can actually output — constrained by architecture. A hypothesis $h \in \mathcal{H}$ assigns predicted labels:

Hypothesis $$h(\mathbf{x}) = \begin{cases} 1 & h \text{ predicts positive} \\ 0 & h \text{ predicts negative} \end{cases}$$

Empirical Error

Empirical Error on Training Set $$E(h \mid \mathcal{X}) = \frac{1}{N}\sum_{t=1}^{N} \mathbf{1}\bigl[h(\mathbf{x}^t) \neq r^t\bigr]$$

S, G, and the Version Space

Most Specific (S)

Tightest boundary still consistent with all positive training examples.

Most General (G)

Broadest consistent boundary — everything positive except confirmed negatives.

Version Space

All hypotheses $h \in \mathcal{H}$ between S and G with zero training error. Any $h$ in the version space is consistent with all available evidence. (Mitchell, 1997)

Margin

When many consistent hypotheses exist, choose the one whose decision boundary is farthest from every training point — the maximum-margin hypothesis. A larger margin means less sensitivity to small perturbations, leading to better generalization. This is the intuition behind Support Vector Machines.

06 · Learning Theory

VC Dimension

The VC dimension measures the capacity of a hypothesis class — how expressive is $\mathcal{H}$, and can it memorize any labeling of $N$ points?

Shattering

$\mathcal{H}$ shatters a set of $N$ points if, for every binary labeling of those points ($2^N$ possibilities), some $h \in \mathcal{H}$ achieves zero error.

Definition

$\text{VC}(\mathcal{H})$ = the largest $N$ such that $\mathcal{H}$ can shatter some set of $N$ points.

Classic Example

An axis-aligned rectangle in $\mathbb{R}^2$ can shatter exactly 4 points. No 5-point configuration can be shattered. Therefore $\text{VC}(\text{axis-aligned rectangles}) = 4$.

Fundamental Theorem of Statistical Learning

$\mathcal{H}$ is PAC-learnable if and only if its VC dimension is finite. VC dimension is the key measure of learnability.

07 · Learning Theory

PAC Learning

Probably Approximately Correct (PAC) learning asks: how many training examples do we need to learn reliably?

PAC-Learnability

$\mathcal{H}$ is PAC-learnable if a learner can, using polynomially many samples, output a hypothesis with error $\leq \varepsilon$ with probability $\geq 1 - \delta$.

PAC Sample Complexity — Blumer et al. (1989) $$m(\varepsilon, \delta) = O\!\left(\frac{1}{\varepsilon}\left(d\log\frac{1}{\varepsilon} + \log\frac{1}{\delta}\right)\right)$$

Reading the Bound

Higher VC dimension $d$ → more expressive class → more samples needed
Smaller $\varepsilon$ (tighter error tolerance) → more samples needed
Smaller $\delta$ (higher confidence required) → more samples needed

Concept Class vs. Hypothesis Class

Concept Class $\mathcal{C}$

The possible true labeling functions — what actually governs the world. (e.g., real consumer definition of "family car")

Hypothesis Class $\mathcal{H}$

Functions the learner can output — constrained by model architecture. (e.g., axis-aligned rectangles)

Realizable vs. Agnostic

Realizable PAC

$\mathcal{C} \subseteq \mathcal{H}$. The true concept lies inside the hypothesis class. Zero training error is achievable in principle.

Agnostic PAC

No assumption about $\mathcal{C}$ vs. $\mathcal{H}$. Goal: find $h \in \mathcal{H}$ competitive with the best possible hypothesis.

Geometric Intuition (Rectangle Case)

When learning a rectangle near the true concept boundary, thin error strips may be missed. For the learned hypothesis to be $\varepsilon$-accurate:

Each error strip has probability mass at most $\varepsilon / 4$
Probability of $N$ examples missing one strip: $(1 - \varepsilon/4)^N$
Over all 4 strips: $4(1 - \varepsilon/4)^N \leq \delta$
Using $(1-x) \leq e^{-x}$: requires $N \geq \tfrac{4}{\varepsilon}\log\tfrac{4}{\delta}$

08 · Learning Theory

Agnostic Learning

Key Insight

In practice, the realizable assumption ($\mathcal{C} \subseteq \mathcal{H}$) almost never holds. ML is almost always operating in the agnostic setting.

Three Reasons ML Is Agnostic

❓

Unknown true function

We don't know the true labeling function, the true distribution, or whether labels are noise-free.

🌊

Messy real data

Label noise, measurement errors, missing features, unobserved confounders, non-stationarity, adversarial samples.

📏

Imperfect model classes

Linear models can't fit nonlinear truth. Decision trees can't represent smooth boundaries. There is always approximation error.

⭕

Circle vs. rectangles

True concept is a circle; $\mathcal{H}$ = axis-aligned rectangles. $\mathcal{C} \not\subseteq \mathcal{H}$ — realizable assumption fails.

Consequence

When the realizable assumption fails: zero training error is unachievable, realizable PAC bounds don't apply, and the learner converges to the best approximation within $\mathcal{H}$ — not the true concept. The irreducible gap is called approximation error.

09 · Practical

Noise and Model Complexity

Sources of Noise

Labeling errors — annotators make mistakes; crowdsourced labels are inconsistent.
Measurement errors — sensors are imprecise; recorded values deviate from true values.
Latent factors — unmeasured variables influence the output; the model is always misspecified.

Occam's Razor — Prefer Simpler Models

When multiple consistent hypotheses fit the data equally well, prefer the simpler one:

⚡

Lower computational cost

Faster inference. Deployable on constrained devices. Cheaper to serve at scale.

🏋️

Easier to train

Fewer parameters → faster convergence, lower space complexity, less sensitivity to initialization.

🔍

More interpretable

Easier to inspect, explain, and audit — critical in high-stakes applications.

📈

Better generalization

Lower variance — less sensitive to the specific random training sample. Occam's Razor in action.

10 · Extensions

Multiple Classes & Regression

Multiple Classes

Binary classification extends to $K$ classes $\{C_1, \ldots, C_K\}$ using one-hot labels. Train one hypothesis per class (one-vs-rest):

Multi-class Setup $$\mathcal{X} = \{\mathbf{x}^t, \mathbf{r}^t\}_{t=1}^{N},\quad r_i^t = \begin{cases} 1 & \mathbf{x}^t \in C_i \\ 0 & \mathbf{x}^t \in C_j,\; j \neq i \end{cases}$$

Regression

When the output $r^t \in \mathbb{R}$ rather than a discrete label, the problem is regression. Assumed data model:

Regression Model $$r^t = f(\mathbf{x}^t) + \varepsilon,\qquad r^t \in \mathbb{R}$$

MSE Loss $$E(g \mid \mathcal{X}) = \frac{1}{N}\sum_{t=1}^{N}(r^t - g(x^t))^2$$

Linear: $g(x) = w_1 x + w_0$

Closed-form solution. One hyperplane in feature space.

Polynomial: $g(x) = w_2 x^2 + w_1 x + w_0$

More flexible, but risks overfitting with limited data.

11 · Generalization

Model Selection & Generalization

Learning Is Ill-Posed

Data alone is insufficient to identify a unique solution — many consistent hypotheses may fit equally well. Without additional assumptions, learning is impossible.

Inductive Bias

The set of assumptions that allow the learner to select one hypothesis among many consistent ones. Every algorithm encodes inductive bias:

Restricting $\mathcal{H}$ to axis-aligned rectangles
Choosing the maximum-margin hypothesis (SVM bias)
Assuming linearity (linear regression bias)
Minimizing MSE as the loss function

Overfitting vs. Underfitting

Overfitting

$\mathcal{H}$ more complex than needed. Memorizes training noise. Low training error, high test error.

Underfitting

$\mathcal{H}$ too simple. Fails to capture real patterns. High training error and high test error.

The Triple Trade-Off — Dietterich, 2003

An unavoidable interplay between three factors. You cannot optimize all three simultaneously:

1

Model Complexity

$c(\mathcal{H})$ — expressiveness of the hypothesis class

⇄

2

Training Set Size

$N$ — labeled examples available

↕

3

Generalization Error

$E$ — performance on unseen data. The ultimate measure.

The Rules

As $N \uparrow$, error $E \downarrow$. As complexity $\uparrow$, error first $\downarrow$ (escaping underfitting) then $\uparrow$ (overfitting). There is an optimal complexity — the sweet spot where the model is expressive enough without memorizing noise.

12 · Evaluation

Cross-Validation

To honestly estimate generalization, evaluate on data the model has never seen during training. A model evaluated only on its training data always looks better than it really is.

📚

Training Set — 50%

Fits model parameters. The algorithm sees this data directly during optimization.

🔧

Validation Set — 25%

Model selection — choosing hyperparameters and architecture. Touched iteratively.

📋

Test Set — 25%

Touched exactly once at the end. Reports honest final performance. Never used for decisions.

🔄

k-Fold CV

When data is scarce: rotate $k$ folds, train on $k{-}1$, validate on the rest. Uses every data point.

13 · Summary

Dimensions of a Supervised Learner

Every supervised learning algorithm — from linear regression to large neural networks — is fully characterized by three design choices:

📐

1 · The Model

The parametric family $g(\mathbf{x} \mid \theta)$. What shapes of functions can the learner represent?

📉

2 · The Loss Function

How is error measured?
$E(\theta|\mathcal{X}) = \sum_t L(r^t, g(\mathbf{x}^t|\theta))$

⚙️

3 · The Optimizer

How is the best $\theta$ found?
$\theta^* = \arg\min_\theta\; E(\theta|\mathcal{X})$

💡

The Universal Template

When encountering any new ML algorithm, ask: what is the model, the loss, and the optimization procedure?

Next Lecture — 20 Feb

Bayesian Decision Theory (Ch. 3). Moving from empirical error minimization to a probabilistic framework — given a prior over classes and a likelihood model, what is the optimal decision boundary?

Machine Learning &Supervised Learning

Course Schedule

Why "Learn"?

When Learning Is Necessary

No human expertise

Inexplicable expertise

Changing solutions

Personalization

What Is Machine Learning?

📊 Statistics

💻 Computer Science

The Data–Knowledge Trade-off

Learning Paradigms

Association Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Learning a Class from Examples

Input features

Output labels

The Training Set

Two Goals of Classification

Hypothesis Space & Version Space

Concept Class $\mathcal{C}$

Hypothesis Class $\mathcal{H}$

Empirical Error

S, G, and the Version Space

Most Specific (S)

Most General (G)

Margin

VC Dimension

Shattering

PAC Learning

Reading the Bound

Concept Class vs. Hypothesis Class

Concept Class $\mathcal{C}$

Hypothesis Class $\mathcal{H}$

Realizable vs. Agnostic

Realizable PAC

Agnostic PAC

Geometric Intuition (Rectangle Case)

Agnostic Learning

Three Reasons ML Is Agnostic

Unknown true function

Messy real data

Imperfect model classes

Circle vs. rectangles

Noise and Model Complexity

Sources of Noise

Occam's Razor — Prefer Simpler Models

Lower computational cost

Easier to train

More interpretable

Better generalization

Multiple Classes & Regression

Multiple Classes

Regression

Linear: $g(x) = w_1 x + w_0$

Polynomial: $g(x) = w_2 x^2 + w_1 x + w_0$

Model Selection & Generalization

Inductive Bias

Overfitting vs. Underfitting

Overfitting

Underfitting

The Triple Trade-Off — Dietterich, 2003

Model Complexity

Training Set Size

Generalization Error

Cross-Validation

Training Set — 50%

Validation Set — 25%

Test Set — 25%

k-Fold CV

Dimensions of a Supervised Learner

1 · The Model

2 · The Loss Function

3 · The Optimizer

The Universal Template

Machine Learning &
Supervised Learning