Parametric Methods
We know how to make decisions using probabilities. Now we learn how to estimate those probabilities from data — through Maximum Likelihood, Bayes' estimator, MAP, and their connections to the loss functions we minimize during training.
Parametric vs. Non-Parametric Methods
Before estimating, we choose a model family. This choice has major consequences for sample efficiency, flexibility, and computational cost.
Parametric
Assumes data follows a fixed-form distribution with a fixed number of parameters, regardless of dataset size. Compact, fast, but potentially underfitting.
Non-Parametric
No fixed-form assumption. Model complexity can grow with the data — more data, more complex model. Flexible, but data-hungry and slower.
| Property | Parametric | Non-Parametric |
|---|---|---|
| Model size | Fixed | Grows with data |
| Assumptions | Strong (e.g., Gaussian, linear) | Minimal — data speaks for itself |
| Training speed | Fast | Slower |
| Data required | Works with less | Needs more |
| Risk | Underfitting if wrong form | Overfitting if insufficient data |
| Examples | Linear regression, Naïve Bayes, HMMs | k-NN, Decision Trees, SVMs (nonlinear) |
Maximum Likelihood Estimation
We assume training data $\mathcal{X} = \{x^t\}_{t=1}^N$ are drawn independently and identically distributed (i.i.d.) from some distribution $p(x \mid \theta)$. The goal: find the parameters $\theta$ that make the observed data most probable.
The likelihood of $\theta$ given the sample $\mathcal{X}$ is the probability of observing $\mathcal{X}$ under parameter $\theta$. Under i.i.d., this factorizes:
We maximize the log-likelihood (equivalent to maximizing likelihood, but computationally better — products become sums, avoiding underflow):
Taking the log converts a product of many small probabilities into a sum, preventing numerical underflow. The maximizer is unchanged since log is monotone increasing. Derivatives of sums are far simpler than derivatives of products.
MLE for Common Distributions
Bernoulli (Binary Outcomes)
$x \in \{0, 1\}$, parameter $p$ (probability of success):
The MLE for the Bernoulli parameter is simply the empirical frequency — the proportion of successes in the sample.
Multinomial (K Outcomes)
$K > 2$ mutually exclusive, exhaustive states. Each $x_i \in \{0,1\}$, $\sum_i x_i = 1$:
Again: the empirical class frequencies. The MLE is the obvious frequency estimate in both cases.
Gaussian (Normal) Distribution
$p(x) = \mathcal{N}(\mu, \sigma^2)$. Taking partial derivatives of $L$ with respect to $\mu$ and $\sigma^2$ and setting to zero yields:
The MLE estimate of variance uses $\frac{1}{N}$, making it a biased estimator — it systematically underestimates the true variance. The unbiased estimator uses $\frac{1}{N-1}$ (Bessel's correction). For large $N$, the difference is negligible.
Bias and Variance of Estimators
An estimator $d(\mathcal{X})$ is itself a random variable — it varies across different training samples. We measure its quality through two components:
Bias
$b(d) = \mathbb{E}[d] - \theta$
How far is the expected estimate from the true value? A biased estimator is systematically wrong in one direction.
Variance
$\mathbb{E}[(d - \mathbb{E}[d])^2]$
How spread out are the estimates across different samples? High variance = sensitive to the specific training set used.
Mean Squared Error Decomposition
The total estimation error decomposes cleanly into bias and variance:
Simpler estimators tend to have high bias but low variance. Complex estimators have low bias but high variance. The ideal estimator minimizes total MSE — which requires balancing both terms.
ERM = Maximum Likelihood Estimation
Here is a profound insight that unifies the statistical and optimization views of machine learning:
Training a model by minimizing a loss function (Empirical Risk Minimization) is mathematically equivalent to estimating parameters via Maximum Likelihood — when the loss corresponds to a negative log-likelihood.
Squared Loss ↔ Gaussian Noise
Assume output noise is Gaussian: $y = f_\theta(x) + \varepsilon$, $\varepsilon \sim \mathcal{N}(0, \sigma^2)$. Then:
Maximizing the Gaussian likelihood = minimizing squared error. Linear regression with MSE loss implicitly assumes Gaussian noise.
Cross-Entropy Loss ↔ Bernoulli / Categorical Likelihood
For binary classification, assume $y \sim \text{Bernoulli}(f_\theta(x))$. Cross-entropy loss is the negative log-likelihood of this model:
A classifier that assigns high probability to the correct class achieves both high likelihood and low cross-entropy. Cross-entropy penalizes confident wrong predictions very heavily (due to the log).
ML parameters are statistical estimators of the underlying data-generating mechanism. Every time you train a model with a specific loss, you're implicitly making an assumption about the noise model for your data.
Bayes' Estimator
MLE treats $\theta$ as a fixed unknown. Bayesian estimation treats $\theta$ as a random variable with its own probability distribution — encoding prior knowledge about plausible values.
Prior and Posterior
We specify a prior distribution $p(\theta)$ representing our beliefs about $\theta$ before seeing data. After observing $\mathcal{X}$, Bayes' rule gives us the posterior:
Prediction for a New Point
Instead of committing to a single $\theta$, we integrate over all possible values, weighted by their posterior probability:
For regression: $y = g(x \mid \mathcal{X}) = \int g(x \mid \theta)\, p(\theta \mid \mathcal{X})\, d\theta$ — the prediction is an average over all models, weighted by how well each fits the data.
Computing this integral is often intractable for complex posteriors. In practice, we either use conjugate priors (which yield closed-form posteriors), variational approximations, or MCMC sampling.
MAP Estimation
When the full posterior integral is too costly, we can collapse the posterior to a single point. Two natural choices:
MAP — Maximum A Posteriori
The mode (peak) of the posterior: $\theta_{\text{MAP}} = \arg\max_\theta\; p(\theta \mid \mathcal{X})$
Fast to compute. Good when the posterior has a sharp, well-defined peak.
Bayes' Estimator — Posterior Mean
The expected value: $\theta_{\text{Bayes}} = \mathbb{E}[\theta \mid \mathcal{X}] = \int \theta\, p(\theta \mid \mathcal{X})\, d\theta$
Minimizes expected squared error. Better when the posterior is asymmetric.
The log-prior acts as a regularizer. A Gaussian prior on $\theta$ leads to L2 (ridge) regularization. A Laplace prior leads to L1 (lasso) regularization. Regularization is Bayesian reasoning in disguise.
When the prior is uniform (no preference over $\theta$), MAP = MLE. The prior adds no information, so data alone determines the estimate.
Comparing the Three Estimators
| Estimator | Formula | When to use |
|---|---|---|
| ML | $\arg\max_\theta\; p(\mathcal{X} \mid \theta)$ | No prior knowledge. Large samples. Computationally simple. |
| MAP | $\arg\max_\theta\; p(\theta \mid \mathcal{X})$ | Have prior knowledge. Want regularization. Need a point estimate. |
| Bayes' (mean) | $\mathbb{E}[\theta \mid \mathcal{X}]$ | Full uncertainty quantification needed. Posterior is asymmetric or broad. |
| Full Bayesian | $p(\theta \mid \mathcal{X})$ (entire distribution) | Maximum uncertainty quantification. Computationally expensive. |
When the posterior is Gaussian (symmetric, unimodal), the mode and mean coincide: $\theta_{\text{MAP}} = \theta_{\text{Bayes}}$. This happens with Gaussian priors on Gaussian likelihoods — a key reason Gaussian models are analytically tractable.
Parametric Classification
To classify using Bayes' rule, we need $P(C_i)$ and $p(\mathbf{x} \mid C_i)$. We model the class-conditional densities as Gaussians and estimate parameters from the training data.
Equal Variances → Linear Boundary
If $\sigma_1 = \sigma_2$ (same variance per class), the quadratic terms cancel and we get a single linear boundary at the midpoint between means (shifted by the prior ratio).
If all priors $P(C_i)$ are equal AND all variances are equal, the discriminant reduces to: assign $\mathbf{x}$ to the class with the nearest mean. This is the simplest possible parametric classifier.
Different Variances → Quadratic Boundary
If $\sigma_1 \neq \sigma_2$, the quadratic terms do not cancel, yielding a quadratic decision boundary with two intersection points between the two Gaussian curves.
Equal variances
One linear boundary. Halfway between means (adjusted by prior). Simpler, less flexible.
Different variances
Two boundaries (quadratic). The narrower Gaussian can dominate in its peak region. More flexible.
Linear Regression via MLE
We model the relationship between input $x$ and continuous output $r$ as:
Assuming Gaussian noise, the MLE for the weights $\mathbf{w}$ is obtained by maximizing the log-likelihood, which reduces to minimizing the sum of squared residuals:
Closed-Form Solution
Taking partial derivatives and setting to zero, the optimal weights satisfy the normal equations:
where $\mathbf{A}$ is the design matrix (each row is an input vector) and $\mathbf{y}$ is the vector of target values.
The derivation shows that minimizing MSE follows directly from the assumption that residuals are normally distributed. If the noise is not Gaussian, MSE may no longer be the right loss function.
Polynomial Regression
We extend linear regression by adding polynomial features. Though the function is nonlinear in $x$, it is linear in the parameters — so the same least-squares machinery applies.
where $\boldsymbol{\phi}(x) = [1, x, x^2, \ldots, x^K]^T$ is the feature map. The design matrix $\mathbf{A}$ has rows $\boldsymbol{\phi}(x^t)^T$, and the solution is again $\mathbf{w}^* = (\mathbf{A}^T \mathbf{A})^{-1}\mathbf{A}^T \mathbf{y}$.
As polynomial degree increases, the MLE coefficients tend to grow in magnitude — the polynomial oscillates wildly to pass through every training point. This is a key symptom of overfitting. Regularization controls this.
Regression Error Metrics
Different error functions encode different assumptions about what kinds of mistakes matter most:
- Squared Error $\sum_t (r^t - g(x^t))^2$ — penalizes large errors quadratically. Follows from Gaussian noise assumption. Used to compute $R^2$.
- Absolute Error $\sum_t |r^t - g(x^t)|$ — robust to outliers. Follows from Laplace noise assumption. Penalizes all errors linearly.
- Relative Squared Error $\sum_t \frac{(r^t - g(x^t))^2}{(r^t - \bar{r})^2}$ — normalized by variance of $r$. Close to 0: model explains the output. Close to 1: model adds no value over just predicting the mean.
- $\varepsilon$-sensitive Error — ignore errors smaller than $\varepsilon$; only penalize errors that exceed the threshold. Used in Support Vector Regression.
$R^2 \approx 1$: excellent fit — the model explains almost all variance. $R^2 \approx 0$: the model is no better than predicting the mean. $R^2 < 0$: the model is worse than predicting the mean (possible with test data).
Bias–Variance in Regression
The expected squared error of a regression estimator $g$ decomposes into three irreducible components:
Bias²
Systematic error from wrong assumptions. A constant predictor $g(x) = 2$ has zero variance but maximum bias.
Variance
Sensitivity to the specific training sample. A high-degree polynomial interpolates training data but varies wildly with different samples.
Irreducible noise
The inherent randomness in $r$ that no model can explain. $\text{Var}(r)$ does not depend on $g$ or $\mathcal{X}$.
As model complexity increases: bias decreases (better fit to true function) but variance increases (more sensitive to training data). The optimal model minimizes total error = Bias² + Variance + Noise. This is the fundamental tension in statistical learning.
Estimating with Cross-Validation
We cannot compute bias directly (we don't know the true function $f$). But we can estimate total generalization error:
- Split data into training and validation sets
- Fit models of increasing complexity on training set
- Training error decreases monotonically with complexity
- Validation error first decreases, then increases — the elbow marks optimal complexity
Model Selection Strategies
How do we choose the right model complexity? Several principled strategies exist:
Cross-Validation
Measure generalization accuracy by testing on data unused during training. Computationally expensive but model-agnostic and reliable.
Regularization
Penalize model complexity directly: $E' = \text{error on data} + \lambda \cdot \text{model complexity}$. AIC and BIC are principled information-theoretic variants.
Minimum Description Length
Choose the model that gives the shortest total description of data + model. Grounded in Kolmogorov complexity (Occam's Razor formalized).
Structural Risk Minimisation
Minimize an upper bound on generalization error that includes both empirical error and a term based on VC dimension.
Regularization in Practice
As polynomial degree increases, coefficient magnitudes blow up. Adding a penalty on coefficient size controls this:
- L2 regularization (Ridge) — penalty $\lambda \|\mathbf{w}\|_2^2$. Shrinks coefficients toward zero. Equivalent to Gaussian prior (MAP). Always has a unique solution.
- L1 regularization (Lasso) — penalty $\lambda \|\mathbf{w}\|_1$. Drives many coefficients to exactly zero (sparse solutions). Equivalent to Laplace prior.
- $\lambda$ tuning — $\lambda = 0$: no regularization (pure MLE). $\lambda \to \infty$: all coefficients forced to zero. Optimal $\lambda$ is found via cross-validation.
Bayesian Model Selection
In the Bayesian framework: place a prior over models $p(\text{model})$. The posterior over models $p(\text{model} \mid \text{data})$ automatically penalizes complexity (Occam's Razor emerges from the math). Simpler models that fit the data well receive higher posterior probability.
Stability
An ML algorithm is stable if small perturbations to the training set — adding, removing, or changing one sample — cause only small changes in the learned model or its predictions.
High Stability
Simple, rigid models. Low sensitivity to data. Low variance. Higher bias. Risk: underfitting.
Low Stability
Complex, flexible models. High sensitivity to data. High variance. Lower bias. Risk: overfitting.
Stability → Generalization Bound
If an algorithm has uniform stability $\beta_n$ (where $\beta_n \to 0$ as $n \to \infty$), then:
This result formally connects stability to generalization: a stable algorithm (one that doesn't change much when the training set changes slightly) will have a small gap between training and test error — meaning it generalizes well.
An algorithm $A$ is $\beta$-uniformly stable if, for any two training sets $S$ and $S'$ that differ by exactly one element, the difference in loss between the models produced by $A(S)$ and $A(S')$ is bounded by $\beta$ at every point. Regularization directly improves stability — and thus generalization.
Probabilistic Graphical Models (Ch. 14). We extend parametric estimation to structured probability distributions over multiple interdependent variables — using graphs to represent conditional independence and enable efficient inference.