Lecture 3 · Chapter 4 · 27 February

Parametric Methods

We know how to make decisions using probabilities. Now we learn how to estimate those probabilities from data — through Maximum Likelihood, Bayes' estimator, MAP, and their connections to the loss functions we minimize during training.

Builds on
Bayesian Decision Theory (Lecture 2)
Chapter
4 — Alpaydin
Date
27 February
00 · Context

Parametric vs. Non-Parametric Methods

Before estimating, we choose a model family. This choice has major consequences for sample efficiency, flexibility, and computational cost.

Parametric

Assumes data follows a fixed-form distribution with a fixed number of parameters, regardless of dataset size. Compact, fast, but potentially underfitting.

Non-Parametric

No fixed-form assumption. Model complexity can grow with the data — more data, more complex model. Flexible, but data-hungry and slower.

PropertyParametricNon-Parametric
Model sizeFixedGrows with data
AssumptionsStrong (e.g., Gaussian, linear)Minimal — data speaks for itself
Training speedFastSlower
Data requiredWorks with lessNeeds more
RiskUnderfitting if wrong formOverfitting if insufficient data
ExamplesLinear regression, Naïve Bayes, HMMsk-NN, Decision Trees, SVMs (nonlinear)
01 · Estimation

Maximum Likelihood Estimation

We assume training data $\mathcal{X} = \{x^t\}_{t=1}^N$ are drawn independently and identically distributed (i.i.d.) from some distribution $p(x \mid \theta)$. The goal: find the parameters $\theta$ that make the observed data most probable.

Likelihood

The likelihood of $\theta$ given the sample $\mathcal{X}$ is the probability of observing $\mathcal{X}$ under parameter $\theta$. Under i.i.d., this factorizes:

Likelihood and Log-Likelihood $$\ell(\theta \mid \mathcal{X}) = p(\mathcal{X} \mid \theta) = \prod_t p(x^t \mid \theta)$$ $$L(\theta \mid \mathcal{X}) = \log\, \ell(\theta \mid \mathcal{X}) = \sum_t \log\, p(x^t \mid \theta)$$

We maximize the log-likelihood (equivalent to maximizing likelihood, but computationally better — products become sums, avoiding underflow):

MLE $$\theta^* = \arg\max_\theta\; L(\theta \mid \mathcal{X})$$
Why Log-Likelihood?

Taking the log converts a product of many small probabilities into a sum, preventing numerical underflow. The maximizer is unchanged since log is monotone increasing. Derivatives of sums are far simpler than derivatives of products.

02 · Examples

MLE for Common Distributions

Bernoulli (Binary Outcomes)

$x \in \{0, 1\}$, parameter $p$ (probability of success):

Bernoulli MLE $$P(x) = p^x (1-p)^{1-x} \qquad \Longrightarrow \qquad \hat{p}_{\text{MLE}} = \frac{\sum_t x^t}{N}$$

The MLE for the Bernoulli parameter is simply the empirical frequency — the proportion of successes in the sample.

Multinomial (K Outcomes)

$K > 2$ mutually exclusive, exhaustive states. Each $x_i \in \{0,1\}$, $\sum_i x_i = 1$:

Multinomial MLE $$P(x_1,\ldots,x_K) = \prod_i p_i^{x_i} \qquad \Longrightarrow \qquad \hat{p}_i = \frac{\sum_t x_i^t}{N}$$

Again: the empirical class frequencies. The MLE is the obvious frequency estimate in both cases.

Gaussian (Normal) Distribution

$p(x) = \mathcal{N}(\mu, \sigma^2)$. Taking partial derivatives of $L$ with respect to $\mu$ and $\sigma^2$ and setting to zero yields:

Gaussian MLE $$\hat{\mu}_{\text{MLE}} = \frac{1}{N}\sum_{t=1}^N x^t \qquad \hat{\sigma}^2_{\text{MLE}} = \frac{1}{N}\sum_{t=1}^N (x^t - \hat{\mu})^2$$
Biased Variance Estimate

The MLE estimate of variance uses $\frac{1}{N}$, making it a biased estimator — it systematically underestimates the true variance. The unbiased estimator uses $\frac{1}{N-1}$ (Bessel's correction). For large $N$, the difference is negligible.

03 · Estimator Quality

Bias and Variance of Estimators

An estimator $d(\mathcal{X})$ is itself a random variable — it varies across different training samples. We measure its quality through two components:

🎯

Bias

$b(d) = \mathbb{E}[d] - \theta$

How far is the expected estimate from the true value? A biased estimator is systematically wrong in one direction.

📊

Variance

$\mathbb{E}[(d - \mathbb{E}[d])^2]$

How spread out are the estimates across different samples? High variance = sensitive to the specific training set used.

Mean Squared Error Decomposition

The total estimation error decomposes cleanly into bias and variance:

MSE = Bias² + Variance $$r(d, \theta) = \mathbb{E}[(d - \theta)^2] = \underbrace{(\mathbb{E}[d] - \theta)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(d - \mathbb{E}[d])^2]}_{\text{Variance}}$$
Bias–Variance Trade-off

Simpler estimators tend to have high bias but low variance. Complex estimators have low bias but high variance. The ideal estimator minimizes total MSE — which requires balancing both terms.

04 · Deep Connection

ERM = Maximum Likelihood Estimation

Here is a profound insight that unifies the statistical and optimization views of machine learning:

Key Result

Training a model by minimizing a loss function (Empirical Risk Minimization) is mathematically equivalent to estimating parameters via Maximum Likelihood — when the loss corresponds to a negative log-likelihood.

ERM Objective $$\hat{\theta} = \arg\min_\theta \sum_{i=1}^n \ell\bigl(f_\theta(x_i),\, y_i\bigr)$$

Squared Loss ↔ Gaussian Noise

Assume output noise is Gaussian: $y = f_\theta(x) + \varepsilon$, $\varepsilon \sim \mathcal{N}(0, \sigma^2)$. Then:

Gaussian Likelihood → Squared Loss $$p(y \mid x, \theta) = \mathcal{N}(f_\theta(x), \sigma^2) \implies -\log p \propto (y - f_\theta(x))^2$$

Maximizing the Gaussian likelihood = minimizing squared error. Linear regression with MSE loss implicitly assumes Gaussian noise.

Cross-Entropy Loss ↔ Bernoulli / Categorical Likelihood

For binary classification, assume $y \sim \text{Bernoulli}(f_\theta(x))$. Cross-entropy loss is the negative log-likelihood of this model:

Bernoulli Likelihood → Cross-Entropy Loss $$-\log p(y_i \mid x_i, \theta) = -\log Q(y_i) = \mathcal{L}_{\text{CE}}$$

A classifier that assigns high probability to the correct class achieves both high likelihood and low cross-entropy. Cross-entropy penalizes confident wrong predictions very heavily (due to the log).

The Takeaway

ML parameters are statistical estimators of the underlying data-generating mechanism. Every time you train a model with a specific loss, you're implicitly making an assumption about the noise model for your data.

05 · Bayesian Estimation

Bayes' Estimator

MLE treats $\theta$ as a fixed unknown. Bayesian estimation treats $\theta$ as a random variable with its own probability distribution — encoding prior knowledge about plausible values.

Prior and Posterior

We specify a prior distribution $p(\theta)$ representing our beliefs about $\theta$ before seeing data. After observing $\mathcal{X}$, Bayes' rule gives us the posterior:

Posterior via Bayes' Rule $$p(\theta \mid \mathcal{X}) = \frac{p(\mathcal{X} \mid \theta)\, p(\theta)}{p(\mathcal{X})} = \frac{p(\mathcal{X} \mid \theta)\, p(\theta)}{\int p(\mathcal{X} \mid \theta')\, p(\theta')\, d\theta'}$$

Prediction for a New Point

Instead of committing to a single $\theta$, we integrate over all possible values, weighted by their posterior probability:

Bayesian Prediction (Marginalization) $$p(x \mid \mathcal{X}) = \int p(x \mid \theta)\, p(\theta \mid \mathcal{X})\, d\theta$$

For regression: $y = g(x \mid \mathcal{X}) = \int g(x \mid \theta)\, p(\theta \mid \mathcal{X})\, d\theta$ — the prediction is an average over all models, weighted by how well each fits the data.

Computational Challenge

Computing this integral is often intractable for complex posteriors. In practice, we either use conjugate priors (which yield closed-form posteriors), variational approximations, or MCMC sampling.

06 · Point Estimates

MAP Estimation

When the full posterior integral is too costly, we can collapse the posterior to a single point. Two natural choices:

🏔️

MAP — Maximum A Posteriori

The mode (peak) of the posterior: $\theta_{\text{MAP}} = \arg\max_\theta\; p(\theta \mid \mathcal{X})$

Fast to compute. Good when the posterior has a sharp, well-defined peak.

⚖️

Bayes' Estimator — Posterior Mean

The expected value: $\theta_{\text{Bayes}} = \mathbb{E}[\theta \mid \mathcal{X}] = \int \theta\, p(\theta \mid \mathcal{X})\, d\theta$

Minimizes expected squared error. Better when the posterior is asymmetric.

MAP as Regularized MLE $$\theta_{\text{MAP}} = \arg\max_\theta \underbrace{\log p(\mathcal{X} \mid \theta)}_{\text{log-likelihood}} + \underbrace{\log p(\theta)}_{\text{log-prior}}$$
MAP = MLE + Regularization

The log-prior acts as a regularizer. A Gaussian prior on $\theta$ leads to L2 (ridge) regularization. A Laplace prior leads to L1 (lasso) regularization. Regularization is Bayesian reasoning in disguise.

Special Case

When the prior is uniform (no preference over $\theta$), MAP = MLE. The prior adds no information, so data alone determines the estimate.

07 · Summary

Comparing the Three Estimators

EstimatorFormulaWhen to use
ML $\arg\max_\theta\; p(\mathcal{X} \mid \theta)$ No prior knowledge. Large samples. Computationally simple.
MAP $\arg\max_\theta\; p(\theta \mid \mathcal{X})$ Have prior knowledge. Want regularization. Need a point estimate.
Bayes' (mean) $\mathbb{E}[\theta \mid \mathcal{X}]$ Full uncertainty quantification needed. Posterior is asymmetric or broad.
Full Bayesian $p(\theta \mid \mathcal{X})$ (entire distribution) Maximum uncertainty quantification. Computationally expensive.
Gaussian Posterior — MAP = Bayes' Mean

When the posterior is Gaussian (symmetric, unimodal), the mode and mean coincide: $\theta_{\text{MAP}} = \theta_{\text{Bayes}}$. This happens with Gaussian priors on Gaussian likelihoods — a key reason Gaussian models are analytically tractable.

08 · Classification

Parametric Classification

To classify using Bayes' rule, we need $P(C_i)$ and $p(\mathbf{x} \mid C_i)$. We model the class-conditional densities as Gaussians and estimate parameters from the training data.

Discriminant Function (Log-Posterior) $$g_i(\mathbf{x}) = \log P(\mathbf{x} \mid C_i) + \log P(C_i) = -\frac{1}{2}\log\sigma_i^2 - \frac{(x - \mu_i)^2}{2\sigma_i^2} + \log P(C_i)$$

Equal Variances → Linear Boundary

If $\sigma_1 = \sigma_2$ (same variance per class), the quadratic terms cancel and we get a single linear boundary at the midpoint between means (shifted by the prior ratio).

Nearest Mean Classifier

If all priors $P(C_i)$ are equal AND all variances are equal, the discriminant reduces to: assign $\mathbf{x}$ to the class with the nearest mean. This is the simplest possible parametric classifier.

Different Variances → Quadratic Boundary

If $\sigma_1 \neq \sigma_2$, the quadratic terms do not cancel, yielding a quadratic decision boundary with two intersection points between the two Gaussian curves.

Equal variances

One linear boundary. Halfway between means (adjusted by prior). Simpler, less flexible.

Different variances

Two boundaries (quadratic). The narrower Gaussian can dominate in its peak region. More flexible.

09 · Regression

Linear Regression via MLE

We model the relationship between input $x$ and continuous output $r$ as:

Regression Data Model $$r^t = g(x^t \mid \mathbf{w}) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)$$

Assuming Gaussian noise, the MLE for the weights $\mathbf{w}$ is obtained by maximizing the log-likelihood, which reduces to minimizing the sum of squared residuals:

Least Squares — From MLE $$\mathbf{w}^* = \arg\min_\mathbf{w} \sum_{t=1}^N (r^t - g(x^t \mid \mathbf{w}))^2$$

Closed-Form Solution

Taking partial derivatives and setting to zero, the optimal weights satisfy the normal equations:

Normal Equations — $\mathbf{A}\mathbf{w} = \mathbf{y}$ $$\mathbf{w}^* = (\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T \mathbf{y}$$

where $\mathbf{A}$ is the design matrix (each row is an input vector) and $\mathbf{y}$ is the vector of target values.

Gaussian Noise → MSE is Optimal

The derivation shows that minimizing MSE follows directly from the assumption that residuals are normally distributed. If the noise is not Gaussian, MSE may no longer be the right loss function.

10 · Regression

Polynomial Regression

We extend linear regression by adding polynomial features. Though the function is nonlinear in $x$, it is linear in the parameters — so the same least-squares machinery applies.

Polynomial Model (Still Linear in $\mathbf{w}$) $$g(x) = w_K x^K + w_{K-1} x^{K-1} + \cdots + w_1 x + w_0 = \mathbf{w}^T \boldsymbol{\phi}(x)$$

where $\boldsymbol{\phi}(x) = [1, x, x^2, \ldots, x^K]^T$ is the feature map. The design matrix $\mathbf{A}$ has rows $\boldsymbol{\phi}(x^t)^T$, and the solution is again $\mathbf{w}^* = (\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T \mathbf{y}$.

Coefficient Magnitudes Signal Overfitting

As polynomial degree increases, the MLE coefficients tend to grow in magnitude — the polynomial oscillates wildly to pass through every training point. This is a key symptom of overfitting. Regularization controls this.

11 · Evaluation

Regression Error Metrics

Different error functions encode different assumptions about what kinds of mistakes matter most:

  • Squared Error $\sum_t (r^t - g(x^t))^2$ — penalizes large errors quadratically. Follows from Gaussian noise assumption. Used to compute $R^2$.
  • Absolute Error $\sum_t |r^t - g(x^t)|$ — robust to outliers. Follows from Laplace noise assumption. Penalizes all errors linearly.
  • Relative Squared Error $\sum_t \frac{(r^t - g(x^t))^2}{(r^t - \bar{r})^2}$ — normalized by variance of $r$. Close to 0: model explains the output. Close to 1: model adds no value over just predicting the mean.
  • $\varepsilon$-sensitive Error — ignore errors smaller than $\varepsilon$; only penalize errors that exceed the threshold. Used in Support Vector Regression.
Coefficient of Determination (R²) $$R^2 = 1 - \text{ERSE} = 1 - \frac{\sum_t (r^t - g(x^t))^2}{\sum_t (r^t - \bar{r})^2}$$
Interpreting R²

$R^2 \approx 1$: excellent fit — the model explains almost all variance. $R^2 \approx 0$: the model is no better than predicting the mean. $R^2 < 0$: the model is worse than predicting the mean (possible with test data).

12 · Generalization

Bias–Variance in Regression

The expected squared error of a regression estimator $g$ decomposes into three irreducible components:

Bias–Variance Decomposition — Geman et al. (1992) $$\mathbb{E}[(r - g(x))^2] = \underbrace{(\mathbb{E}[g(x)] - f(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(g(x) - \mathbb{E}[g(x)])^2]}_{\text{Variance}} + \underbrace{\text{Var}(r)}_{\text{Irreducible noise}}$$
🎯

Bias²

Systematic error from wrong assumptions. A constant predictor $g(x) = 2$ has zero variance but maximum bias.

📉

Variance

Sensitivity to the specific training sample. A high-degree polynomial interpolates training data but varies wildly with different samples.

🌊

Irreducible noise

The inherent randomness in $r$ that no model can explain. $\text{Var}(r)$ does not depend on $g$ or $\mathcal{X}$.

The Dilemma

As model complexity increases: bias decreases (better fit to true function) but variance increases (more sensitive to training data). The optimal model minimizes total error = Bias² + Variance + Noise. This is the fundamental tension in statistical learning.

Estimating with Cross-Validation

We cannot compute bias directly (we don't know the true function $f$). But we can estimate total generalization error:

  • Split data into training and validation sets
  • Fit models of increasing complexity on training set
  • Training error decreases monotonically with complexity
  • Validation error first decreases, then increases — the elbow marks optimal complexity
13 · Model Selection

Model Selection Strategies

How do we choose the right model complexity? Several principled strategies exist:

🔄

Cross-Validation

Measure generalization accuracy by testing on data unused during training. Computationally expensive but model-agnostic and reliable.

⚖️

Regularization

Penalize model complexity directly: $E' = \text{error on data} + \lambda \cdot \text{model complexity}$. AIC and BIC are principled information-theoretic variants.

📦

Minimum Description Length

Choose the model that gives the shortest total description of data + model. Grounded in Kolmogorov complexity (Occam's Razor formalized).

🏗️

Structural Risk Minimisation

Minimize an upper bound on generalization error that includes both empirical error and a term based on VC dimension.

Regularization in Practice

As polynomial degree increases, coefficient magnitudes blow up. Adding a penalty on coefficient size controls this:

Regularized Objective (L2 / Ridge) $$E'(\mathbf{w}) = \sum_t (r^t - g(x^t \mid \mathbf{w}))^2 + \lambda \|\mathbf{w}\|^2$$
  • L2 regularization (Ridge) — penalty $\lambda \|\mathbf{w}\|_2^2$. Shrinks coefficients toward zero. Equivalent to Gaussian prior (MAP). Always has a unique solution.
  • L1 regularization (Lasso) — penalty $\lambda \|\mathbf{w}\|_1$. Drives many coefficients to exactly zero (sparse solutions). Equivalent to Laplace prior.
  • $\lambda$ tuning — $\lambda = 0$: no regularization (pure MLE). $\lambda \to \infty$: all coefficients forced to zero. Optimal $\lambda$ is found via cross-validation.

Bayesian Model Selection

In the Bayesian framework: place a prior over models $p(\text{model})$. The posterior over models $p(\text{model} \mid \text{data})$ automatically penalizes complexity (Occam's Razor emerges from the math). Simpler models that fit the data well receive higher posterior probability.

14 · Generalization Theory

Stability

Definition

An ML algorithm is stable if small perturbations to the training set — adding, removing, or changing one sample — cause only small changes in the learned model or its predictions.

High Stability

Simple, rigid models. Low sensitivity to data. Low variance. Higher bias. Risk: underfitting.

Low Stability

Complex, flexible models. High sensitivity to data. High variance. Lower bias. Risk: overfitting.

Stability → Generalization Bound

If an algorithm has uniform stability $\beta_n$ (where $\beta_n \to 0$ as $n \to \infty$), then:

Stability Generalization Bound $$\bigl|\mathbb{E}[\text{Test Error}] - \mathbb{E}[\text{Train Error}]\bigr| \leq \beta_n$$
More Stability → Better Generalization

This result formally connects stability to generalization: a stable algorithm (one that doesn't change much when the training set changes slightly) will have a small gap between training and test error — meaning it generalizes well.

Formal Definition

An algorithm $A$ is $\beta$-uniformly stable if, for any two training sets $S$ and $S'$ that differ by exactly one element, the difference in loss between the models produced by $A(S)$ and $A(S')$ is bounded by $\beta$ at every point. Regularization directly improves stability — and thus generalization.


Next Lecture — 06 Mar

Probabilistic Graphical Models (Ch. 14). We extend parametric estimation to structured probability distributions over multiple interdependent variables — using graphs to represent conditional independence and enable efficient inference.