Lecture 2 · Chapter 3 · 20 February

Bayesian Decision Theory

How do we make optimal decisions when outcomes are uncertain? Bayesian Decision Theory combines probabilistic beliefs with a formal framework of actions, losses, and expected risk to derive principled decision rules.

Builds on
VC Dimensions, PAC Learning, Inductive Bias
Chapter
3 — Alpaydin
Date
20 February
00 · Context

Why This Chapter?

Lecture 1 showed that learning is uncertain at every level: the true concept class, the right hypothesis, the amount of data needed. Now we ask: given that uncertainty, how should we act?

Central Question

We have uncertainty about which class a data point belongs to. We have uncertainty about model parameters. We have costs attached to being wrong. How do we make the best possible decision?

The answer is Bayesian Decision Theory — an extension of Bayesian statistics that adds actions and consequences to probabilistic reasoning.

01 · Foundations

Bayesian Statistics

Bayesian Probability

Probability expresses a degree of belief in an event — not just a long-run frequency. Beliefs may come from prior experiments or personal knowledge, and are updated as new evidence arrives.

Bayesian Statistics

The process of updating beliefs with evidence using probability as a measure of uncertainty. Starts with a prior; revises it into a posterior after observing data.

Bayesian Decision Theory

Extends Bayesian statistics by adding actions and consequences. Tells you how to choose the best action when outcomes are uncertain, by minimizing expected loss.

The four components of any Bayesian decision problem:

  • Posterior probabilities — your updated beliefs about the state of the world
  • A set of possible actions — what choices are available
  • A loss / utility function — what you care about: cost, risk, reward
  • A decision rule — select the action with lowest expected loss (or highest expected utility)
02 · Inference

Probability and Inference

Consider a coin toss. Even if the physical process is deterministic, we treat it as random because the causal variables (material, initial position, force, momentum) are unobservable. We work with what we can observe.

Bernoulli Model

Let $X \in \{0,1\}$ represent the coin outcome. The Bernoulli distribution with parameter $p_0$ is:

Bernoulli Distribution $$P(X = x) = p_0^x (1 - p_0)^{1-x}$$

That is: $P(X=1) = p_0$ and $P(X=0) = 1 - p_0$.

Estimation from Data

Given a sample $\mathcal{X} = \{x^t\}_{t=1}^N$, we estimate the unknown parameter:

Frequency Estimator $$\hat{p}_0 = \frac{\#\{\text{Heads}\}}{\#\{\text{Tosses}\}} = \frac{\sum_t x^t}{N}$$
Prediction Rule

Given estimated $\hat{p}_0$, predict Heads if $\hat{p}_0 > \tfrac{1}{2}$, Tails otherwise — this minimizes expected misclassification.

03 · The Core Formula

Bayes' Rule

Classification example: credit scoring. Input $\mathbf{x} = [x_1, x_2]^T$ (income, savings). Output: $C \in \{0, 1\}$ (low-risk, high-risk). We want $P(C \mid \mathbf{x})$.

Bayes' Rule $$\underbrace{P(C_k \mid \mathbf{x})}_{\text{posterior}} = \frac{\overbrace{P(\mathbf{x} \mid C_k)}^{\text{likelihood}} \cdot \overbrace{P(C_k)}^{\text{prior}}}{\underbrace{P(\mathbf{x})}_{\text{evidence}}}$$

Prior $P(C_k)$

What fraction of all cases belong to class $C_k$? Our belief before seeing the input $\mathbf{x}$.

Likelihood $P(\mathbf{x} \mid C_k)$

Assuming class $C_k$, how probable is seeing this observation $\mathbf{x}$? Models the feature distribution per class.

Evidence $P(\mathbf{x})$

Marginal probability of $\mathbf{x}$ across all classes. A normalizing constant — same for all classes, so often ignored in decisions.

Posterior $P(C_k \mid \mathbf{x})$

Our updated belief after seeing $\mathbf{x}$. The key quantity for making decisions. Choose the class with highest posterior.

Binary Classification Rule

For two classes, choose class 1 if:

Binary Decision $$P(C_1 \mid \mathbf{x}) > P(C_0 \mid \mathbf{x})$$

K > 2 Classes

For mutually exclusive and exhaustive classes (empty pairwise intersections, union covers all observations):

Posterior for K Classes $$P(C_k \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid C_k)\, P(C_k)}{\sum_{j=1}^{K} P(\mathbf{x} \mid C_j)\, P(C_j)}$$

Decide by choosing: $\hat{k} = \arg\max_k\; P(C_k \mid \mathbf{x})$

04 · Decision Making

Losses and Risks

Not all mistakes are equally costly. A bank accepting a bad loan vs. rejecting a good customer have different financial consequences. We formalize this with a loss function.

Setup

Action $\alpha_i$ = assign input to class $C_i$. Loss $\lambda_{ik}$ = cost of taking action $\alpha_i$ when the true state is $C_k$.

Expected Risk

The expected risk of action $\alpha_i$ is the loss averaged over all possible true states, weighted by their posterior probabilities (Duda & Hart, 1973):

Expected Risk $$R(\alpha_i \mid \mathbf{x}) = \sum_{k=1}^{K} \lambda_{ik}\, P(C_k \mid \mathbf{x})$$
Decision Rule

Choose the action $\alpha_i$ that minimizes expected risk: $\hat{\alpha} = \arg\min_i\; R(\alpha_i \mid \mathbf{x})$

Asymmetric Losses

Losses are often asymmetric. In fraud detection, failing to catch fraud (false negative) may cost far more than incorrectly flagging a transaction (false positive). The loss matrix $\lambda_{ik}$ captures this asymmetry.

05 · Loss Functions

0/1 Loss

The simplest case: all mistakes are equally bad, all correct decisions have zero cost.

0/1 Loss Matrix $$\lambda_{ik} = \begin{cases} 0 & \text{if } i = k \quad (\text{correct}) \\ 1 & \text{if } i \neq k \quad (\text{mistake}) \end{cases}$$

Plugging into expected risk:

Risk under 0/1 Loss $$R(\alpha_i \mid \mathbf{x}) = \sum_{k \neq i} P(C_k \mid \mathbf{x}) = 1 - P(C_i \mid \mathbf{x})$$
Result

Minimizing risk under 0/1 loss = choosing the most probable class. This is why MAP (Maximum A Posteriori) classification is optimal under 0/1 loss.

06 · Loss Functions

The Reject Option

Sometimes misclassification is so costly that we prefer to abstain — output "I don't know" — rather than risk a wrong decision. This is the reject option, common in medical diagnosis and autonomous systems.

Introduce a reject action $\alpha_{K+1}$ with fixed cost $\lambda_r$ (cost of abstaining), where $0 < \lambda_r < 1$.

Decision Rule with Reject

Classify if confident enough, reject otherwise. Formally: reject if the risk of any classification exceeds $\lambda_r$. Equivalently — reject if $\max_k P(C_k \mid \mathbf{x}) < 1 - \lambda_r$, i.e., we're not confident that any single class is correct.

Equal losses + no reject

Single decision boundary at the crossing of class posteriors. Standard MAP classification.

Unequal losses

Boundary shifts toward the less costly class. The model is biased away from expensive errors.

With reject option

Two thresholds bracket a reject region around the boundary where posteriors are close.

Cost-sensitive learning

In high-stakes domains (fraud, cancer), the loss matrix often matters more than raw accuracy.

07 · Decision Boundaries

Discriminant Functions

Instead of computing posteriors explicitly, we can define a discriminant function $g_i(\mathbf{x})$ for each class. Assign $\mathbf{x}$ to $C_i$ if $g_i(\mathbf{x}) > g_j(\mathbf{x})$ for all $j \neq i$.

Discriminant Regions $$\mathbf{x} \in \mathcal{R}_i \iff g_i(\mathbf{x}) = \max_k\; g_k(\mathbf{x})$$

Common choices for $g_i(\mathbf{x})$:

  • $g_i(\mathbf{x}) = P(C_i \mid \mathbf{x})$ — the posterior directly
  • $g_i(\mathbf{x}) = P(\mathbf{x} \mid C_i)\, P(C_i)$ — joint, avoiding the evidence term
  • $g_i(\mathbf{x}) = \log P(\mathbf{x} \mid C_i) + \log P(C_i)$ — log-posterior (numerically stable)
  • $g_i(\mathbf{x}) = -R(\alpha_i \mid \mathbf{x})$ — negative risk (minimizing risk = maximizing this)

Binary Case (K = 2): Log-Odds

For two classes, a single discriminant $g(\mathbf{x}) = g_1(\mathbf{x}) - g_2(\mathbf{x})$ suffices. The log-odds form is especially useful:

Log-Odds (Log Posterior Ratio) $$g(\mathbf{x}) = \log \frac{P(C_1 \mid \mathbf{x})}{P(C_2 \mid \mathbf{x})} = \log \frac{P(\mathbf{x} \mid C_1)}{P(\mathbf{x} \mid C_2)} + \log \frac{P(C_1)}{P(C_2)}$$

Choose $C_1$ if $g(\mathbf{x}) > 0$, $C_2$ otherwise. The boundary $g(\mathbf{x}) = 0$ is the decision surface.

08 · Generalisation

Utility Theory

Losses can be replaced by utilities — the mirror image of loss. Instead of minimizing expected loss, we maximize expected utility.

Expected Utility of Action $\alpha_i$ $$EU(\alpha_i \mid \mathbf{x}) = \sum_{k=1}^{K} U_{ik}\, P(S_k \mid \mathbf{x})$$

where $U_{ik}$ is the utility (reward) of taking action $\alpha_i$ when the true state is $S_k$. The optimal action maximizes $EU$.

Loss vs. Utility

The two frameworks are equivalent: set $U_{ik} = -\lambda_{ik}$. Maximizing expected utility is the same as minimizing expected loss. The choice of framing depends on context — losses are natural for errors, utilities for rewards.

09 · Unsupervised

Association Rules

Association learning discovers co-occurrence patterns in data — no labels required. The classic setting is market basket analysis: which products are bought together?

Association Rule

A rule $X \to Y$ states: people who buy/click/visit $X$ are also likely to buy/click/visit $Y$. The rule implies association, not causation. $X$ is the antecedent, $Y$ the consequent.

Correlation ≠ Causation

The famous example: diapers and baby food co-occur because of an unobserved latent variable — the presence of a baby. The association is real, but neither product causes the other to be purchased.

10 · Measures

Measuring Rule Quality

Three complementary measures characterize how good a rule $X \to Y$ is:

📊

Support

$P(X \cap Y)$ — how frequent is the pattern? Rules must be seen enough times to be statistically meaningful.

💪

Confidence

$P(Y \mid X)$ — given $X$, how often does $Y$ follow? The strength of the rule. Should be significantly larger than $P(Y)$.

📐

Lift

$\frac{P(Y \mid X)}{P(Y)}$ — the degree of dependence. Lift $= 1$: independent. Lift $> 1$: $X$ makes $Y$ more likely. Lift $< 1$: $X$ makes $Y$ less likely.

The Three Measures $$\text{Support}(X \to Y) = P(X \cap Y)$$ $$\text{Confidence}(X \to Y) = P(Y \mid X) = \frac{P(X \cap Y)}{P(X)}$$ $$\text{Lift}(X \to Y) = \frac{P(Y \mid X)}{P(Y)} = \frac{P(X \cap Y)}{P(X)\,P(Y)}$$
What to Look For

A good rule has high support (seen frequently enough to be statistically reliable), high confidence (the association is strong), and lift significantly greater than 1 (the items are not independent — $X$ genuinely predicts $Y$).

11 · Algorithm

Apriori Algorithm

Finding all frequent item sets by brute force is exponential. The Apriori algorithm (Agrawal et al., 1996) uses a key insight to prune the search space efficiently.

The Apriori Property (Anti-Monotonicity)

If item set $\{X, Y\}$ is not frequent (support below threshold), then no superset $\{X, Y, Z\}$ can be frequent. This prunes entire branches of the search tree.

Algorithm Steps

  • Step 1 — Find all frequent 1-item sets (single items with support ≥ threshold)
  • Step 2 — From frequent $k$-item sets, generate all $(k+1)$-item supersets. Keep only those that are frequent
  • Step 3 — Repeat until no new frequent sets are found
  • Step 4 — Convert frequent sets to rules: for each frequent set, try moving items from antecedent to consequent. Keep rules with sufficient confidence
Better Alternatives

Apriori requires multiple passes over the data. FP-Tree (Frequent Pattern Tree) encodes the database into a compact tree structure, avoiding repeated scans — significantly faster in practice.

Bayesian Perspective on Association

From a probabilistic view, co-occurring events often indicate latent hidden variables. The diaper–baby food association:

Latent Variable Decomposition $$P(\text{diaper}, \text{baby food}) = P(\text{diaper}, \text{baby food} \mid \text{baby}) \cdot P(\text{baby}) + \varepsilon$$

This yields $\approx 0.56 \times 0.3 \approx 0.168$ — matching the observed co-occurrence (support). The association is real, but its cause is the unobserved latent variable (presence of a baby in the household).


Next Lecture — 27 Feb

Parametric Methods (Ch. 4). Now that we know how to make decisions using probabilities, we turn to the question of how to estimate those probabilities from training data — MLE, MAP, Bayes' estimator, and parametric classification.