Bayesian Decision Theory
How do we make optimal decisions when outcomes are uncertain? Bayesian Decision Theory combines probabilistic beliefs with a formal framework of actions, losses, and expected risk to derive principled decision rules.
Why This Chapter?
Lecture 1 showed that learning is uncertain at every level: the true concept class, the right hypothesis, the amount of data needed. Now we ask: given that uncertainty, how should we act?
We have uncertainty about which class a data point belongs to. We have uncertainty about model parameters. We have costs attached to being wrong. How do we make the best possible decision?
The answer is Bayesian Decision Theory — an extension of Bayesian statistics that adds actions and consequences to probabilistic reasoning.
Bayesian Statistics
Probability expresses a degree of belief in an event — not just a long-run frequency. Beliefs may come from prior experiments or personal knowledge, and are updated as new evidence arrives.
Bayesian Statistics
The process of updating beliefs with evidence using probability as a measure of uncertainty. Starts with a prior; revises it into a posterior after observing data.
Bayesian Decision Theory
Extends Bayesian statistics by adding actions and consequences. Tells you how to choose the best action when outcomes are uncertain, by minimizing expected loss.
The four components of any Bayesian decision problem:
- Posterior probabilities — your updated beliefs about the state of the world
- A set of possible actions — what choices are available
- A loss / utility function — what you care about: cost, risk, reward
- A decision rule — select the action with lowest expected loss (or highest expected utility)
Probability and Inference
Consider a coin toss. Even if the physical process is deterministic, we treat it as random because the causal variables (material, initial position, force, momentum) are unobservable. We work with what we can observe.
Bernoulli Model
Let $X \in \{0,1\}$ represent the coin outcome. The Bernoulli distribution with parameter $p_0$ is:
That is: $P(X=1) = p_0$ and $P(X=0) = 1 - p_0$.
Estimation from Data
Given a sample $\mathcal{X} = \{x^t\}_{t=1}^N$, we estimate the unknown parameter:
Given estimated $\hat{p}_0$, predict Heads if $\hat{p}_0 > \tfrac{1}{2}$, Tails otherwise — this minimizes expected misclassification.
Bayes' Rule
Classification example: credit scoring. Input $\mathbf{x} = [x_1, x_2]^T$ (income, savings). Output: $C \in \{0, 1\}$ (low-risk, high-risk). We want $P(C \mid \mathbf{x})$.
Prior $P(C_k)$
What fraction of all cases belong to class $C_k$? Our belief before seeing the input $\mathbf{x}$.
Likelihood $P(\mathbf{x} \mid C_k)$
Assuming class $C_k$, how probable is seeing this observation $\mathbf{x}$? Models the feature distribution per class.
Evidence $P(\mathbf{x})$
Marginal probability of $\mathbf{x}$ across all classes. A normalizing constant — same for all classes, so often ignored in decisions.
Posterior $P(C_k \mid \mathbf{x})$
Our updated belief after seeing $\mathbf{x}$. The key quantity for making decisions. Choose the class with highest posterior.
Binary Classification Rule
For two classes, choose class 1 if:
K > 2 Classes
For mutually exclusive and exhaustive classes (empty pairwise intersections, union covers all observations):
Decide by choosing: $\hat{k} = \arg\max_k\; P(C_k \mid \mathbf{x})$
Losses and Risks
Not all mistakes are equally costly. A bank accepting a bad loan vs. rejecting a good customer have different financial consequences. We formalize this with a loss function.
Action $\alpha_i$ = assign input to class $C_i$. Loss $\lambda_{ik}$ = cost of taking action $\alpha_i$ when the true state is $C_k$.
Expected Risk
The expected risk of action $\alpha_i$ is the loss averaged over all possible true states, weighted by their posterior probabilities (Duda & Hart, 1973):
Choose the action $\alpha_i$ that minimizes expected risk: $\hat{\alpha} = \arg\min_i\; R(\alpha_i \mid \mathbf{x})$
Losses are often asymmetric. In fraud detection, failing to catch fraud (false negative) may cost far more than incorrectly flagging a transaction (false positive). The loss matrix $\lambda_{ik}$ captures this asymmetry.
0/1 Loss
The simplest case: all mistakes are equally bad, all correct decisions have zero cost.
Plugging into expected risk:
Minimizing risk under 0/1 loss = choosing the most probable class. This is why MAP (Maximum A Posteriori) classification is optimal under 0/1 loss.
The Reject Option
Sometimes misclassification is so costly that we prefer to abstain — output "I don't know" — rather than risk a wrong decision. This is the reject option, common in medical diagnosis and autonomous systems.
Introduce a reject action $\alpha_{K+1}$ with fixed cost $\lambda_r$ (cost of abstaining), where $0 < \lambda_r < 1$.
Classify if confident enough, reject otherwise. Formally: reject if the risk of any classification exceeds $\lambda_r$. Equivalently — reject if $\max_k P(C_k \mid \mathbf{x}) < 1 - \lambda_r$, i.e., we're not confident that any single class is correct.
Equal losses + no reject
Single decision boundary at the crossing of class posteriors. Standard MAP classification.
Unequal losses
Boundary shifts toward the less costly class. The model is biased away from expensive errors.
With reject option
Two thresholds bracket a reject region around the boundary where posteriors are close.
Cost-sensitive learning
In high-stakes domains (fraud, cancer), the loss matrix often matters more than raw accuracy.
Discriminant Functions
Instead of computing posteriors explicitly, we can define a discriminant function $g_i(\mathbf{x})$ for each class. Assign $\mathbf{x}$ to $C_i$ if $g_i(\mathbf{x}) > g_j(\mathbf{x})$ for all $j \neq i$.
Common choices for $g_i(\mathbf{x})$:
- $g_i(\mathbf{x}) = P(C_i \mid \mathbf{x})$ — the posterior directly
- $g_i(\mathbf{x}) = P(\mathbf{x} \mid C_i)\, P(C_i)$ — joint, avoiding the evidence term
- $g_i(\mathbf{x}) = \log P(\mathbf{x} \mid C_i) + \log P(C_i)$ — log-posterior (numerically stable)
- $g_i(\mathbf{x}) = -R(\alpha_i \mid \mathbf{x})$ — negative risk (minimizing risk = maximizing this)
Binary Case (K = 2): Log-Odds
For two classes, a single discriminant $g(\mathbf{x}) = g_1(\mathbf{x}) - g_2(\mathbf{x})$ suffices. The log-odds form is especially useful:
Choose $C_1$ if $g(\mathbf{x}) > 0$, $C_2$ otherwise. The boundary $g(\mathbf{x}) = 0$ is the decision surface.
Utility Theory
Losses can be replaced by utilities — the mirror image of loss. Instead of minimizing expected loss, we maximize expected utility.
where $U_{ik}$ is the utility (reward) of taking action $\alpha_i$ when the true state is $S_k$. The optimal action maximizes $EU$.
The two frameworks are equivalent: set $U_{ik} = -\lambda_{ik}$. Maximizing expected utility is the same as minimizing expected loss. The choice of framing depends on context — losses are natural for errors, utilities for rewards.
Association Rules
Association learning discovers co-occurrence patterns in data — no labels required. The classic setting is market basket analysis: which products are bought together?
A rule $X \to Y$ states: people who buy/click/visit $X$ are also likely to buy/click/visit $Y$. The rule implies association, not causation. $X$ is the antecedent, $Y$ the consequent.
The famous example: diapers and baby food co-occur because of an unobserved latent variable — the presence of a baby. The association is real, but neither product causes the other to be purchased.
Measuring Rule Quality
Three complementary measures characterize how good a rule $X \to Y$ is:
Support
$P(X \cap Y)$ — how frequent is the pattern? Rules must be seen enough times to be statistically meaningful.
Confidence
$P(Y \mid X)$ — given $X$, how often does $Y$ follow? The strength of the rule. Should be significantly larger than $P(Y)$.
Lift
$\frac{P(Y \mid X)}{P(Y)}$ — the degree of dependence. Lift $= 1$: independent. Lift $> 1$: $X$ makes $Y$ more likely. Lift $< 1$: $X$ makes $Y$ less likely.
A good rule has high support (seen frequently enough to be statistically reliable), high confidence (the association is strong), and lift significantly greater than 1 (the items are not independent — $X$ genuinely predicts $Y$).
Apriori Algorithm
Finding all frequent item sets by brute force is exponential. The Apriori algorithm (Agrawal et al., 1996) uses a key insight to prune the search space efficiently.
If item set $\{X, Y\}$ is not frequent (support below threshold), then no superset $\{X, Y, Z\}$ can be frequent. This prunes entire branches of the search tree.
Algorithm Steps
- Step 1 — Find all frequent 1-item sets (single items with support ≥ threshold)
- Step 2 — From frequent $k$-item sets, generate all $(k+1)$-item supersets. Keep only those that are frequent
- Step 3 — Repeat until no new frequent sets are found
- Step 4 — Convert frequent sets to rules: for each frequent set, try moving items from antecedent to consequent. Keep rules with sufficient confidence
Apriori requires multiple passes over the data. FP-Tree (Frequent Pattern Tree) encodes the database into a compact tree structure, avoiding repeated scans — significantly faster in practice.
Bayesian Perspective on Association
From a probabilistic view, co-occurring events often indicate latent hidden variables. The diaper–baby food association:
This yields $\approx 0.56 \times 0.3 \approx 0.168$ — matching the observed co-occurrence (support). The association is real, but its cause is the unobserved latent variable (presence of a baby in the household).
Parametric Methods (Ch. 4). Now that we know how to make decisions using probabilities, we turn to the question of how to estimate those probabilities from training data — MLE, MAP, Bayes' estimator, and parametric classification.