Probabilistic
Graphical Models
Graphical models give us a language for representing complex probability distributions over many interdependent variables — encoding conditional independence structure visually, and enabling efficient inference and learning at scale.
What Are Graphical Models?
A graphical model is a directed acyclic graph (DAG) where nodes are random variables and directed edges represent direct probabilistic dependencies. Edge labels encode conditional probabilities.
Nodes
Each node is a random variable (a hypothesis). The label on a node $X$ describes the probability $P(X)$ — our degree of belief in the truth of $X$.
Edges
A directed edge $X \to Y$ represents a direct influence from $X$ on $Y$. The edge label is $P(Y \mid X)$ — the conditional probability of $Y$ given $X$.
Structure
The graph must be a DAG — directed and acyclic. This prevents circular reasoning and guarantees a valid factorization of the joint distribution.
Parameters
The conditional probability tables (CPTs) on each edge. Learning a graphical model means learning both structure (the DAG) and parameters (the CPTs).
Why Graphical Models?
- Compact representation — encode the joint distribution of $n$ variables without specifying all $2^n$ probabilities. Local structure breaks the problem into small conditional tables.
- Flexible inference — any variable can be treated as input (evidence) or output (query). No fixed input/output designation needed.
- Hidden variables — nodes whose values are never observed in data can be included. They model latent causes (e.g., "baby at home" explaining co-occurrence of diapers and baby food).
- Causal reasoning — directed edges suggest causal mechanisms, enabling both causal (forward) and diagnostic (backward) inference.
An edge $X \to Y$ does not always imply causality. Graphical models represent statistical dependencies. Causality requires additional assumptions beyond correlation structure.
Conditional Independence
The key concept behind graphical models is conditional independence — knowing one variable can make two others independent, or dependent, depending on graph structure.
Independence
$X$ and $Y$ are independent if knowing $X$ tells you nothing new about $Y$:
$P(X, Y) = P(X)\,P(Y)$
Conditional Independence
$X$ and $Y$ are conditionally independent given $Z$ if knowing $Z$ makes knowing $X$ irrelevant for predicting $Y$:
$P(X, Y \mid Z) = P(X \mid Z)\,P(Y \mid Z)$
equivalently: $P(X \mid Y, Z) = P(X \mid Z)$
There are three canonical patterns in a DAG that determine when conditional independence holds. Each arises from the direction of edges meeting at a node.
The Three Canonical Cases
Every path in a Bayesian network passes through one of three local configurations. Understanding how information flows (or is blocked) through each is the foundation of all inference in graphical models.
$X$ influences $Y$, which influences $Z$. This is a chain: information flows along the path.
When $Y$ is unobserved: $X$ and $Z$ are dependent — information flows freely through $Y$.
When $Y$ is observed: $X \perp Z \mid Y$. Knowing $Y$ completely accounts for $X$'s influence on $Z$. $Y$ blocks the path.
$X$ is a common cause of $Y$ and $Z$. Even without a direct link, $Y$ and $Z$ are correlated through their shared parent $X$.
When $X$ is unobserved: $Y$ and $Z$ are dependent — observing $Y$ updates our belief about $X$, which then affects $Z$.
When $X$ is observed: $Y \perp Z \mid X$. Knowing $X$ screens off the influence. $X$ blocks the path.
$X$ and $Y$ are independent causes of $Z$. This is a collider — the opposite behaviour of Cases 1 and 2.
When $Z$ is unobserved: $X \perp Y$. No path connects them. $Z$ blocks the path when unobserved.
When $Z$ is observed (or any descendant of $Z$): $X$ and $Y$ become dependent — observing the common effect opens the path. This gives rise to explaining away.
Cases 1 and 2 block information when the middle node is observed. Case 3 blocks information when the middle node is not observed. This reversal is the most counter-intuitive property of graphical models — and the source of explaining away.
Causal vs. Diagnostic Inference
The sprinkler network is the canonical example for graphical model inference. Variables: Cloudy $C$ → Rain $R$, Sprinkler $S$ → Wet grass $W$. Also $C \to S$ and $R, S \to W$.
Causal Inference (Top-Down)
Reasoning from causes to effects. Sum over all configurations of intermediate variables.
"If the sprinkler is on, what is $P(W)$?"
$P(W \mid S) = P(W \mid R,S)\,P(R) + P(W \mid \lnot R,S)\,P(\lnot R)$
$= 0.95 \times 0.4 + 0.9 \times 0.6 = 0.92$
Diagnostic Inference (Bottom-Up)
Reasoning from observations back to their causes. Apply Bayes' rule to invert the arc.
"If the grass is wet, what is $P(S)$?"
$P(S \mid W) = 0.35 > 0.20 = P(S)$
Observing the wet grass increases the probability that the sprinkler was on.
Multi-Hop Inference
Graphical models allow inference to propagate across the entire network. For example, computing $P(W \mid C)$ requires summing over all combinations of $R$ and $S$:
The last equality uses the fact that $R \perp S \mid C$ (tail-to-tail structure with $C$ observed — Case 2).
The same model supports any query direction. Evidence can be clamped at any node; any other node becomes a query. This is what makes graphical models so powerful as generative models.
Explaining Away
Explaining away (also called Berkson's paradox) is the phenomenon where observing a common effect makes its independent causes negatively correlated — one explanation "explains away" the other.
Rain $R$ and Sprinkler $S$ are marginally independent — $P(R,S) = P(R)\,P(S)$. But once we observe that the grass is wet ($W = 1$, the common effect), they become dependent. Knowing it rained reduces the probability that the sprinkler caused the wet grass:
Rain explains away the sprinkler as the cause of the wet grass. Two previously independent hypotheses compete to explain the observed evidence. This is a ubiquitous pattern in diagnosis, fault detection, and scientific reasoning.
Medical diagnosis: observing a symptom (wet grass) and learning one disease (rain) is present reduces the probability of another disease (sprinkler) being responsible. Fault detection: detecting a bug in one module reduces the probability of a bug elsewhere when a system failure is observed.
d-Separation
d-Separation generalizes the three canonical cases to arbitrary graph structures. It tells us, for any two nodes $A$ and $B$ and any conditioning set $C$, whether $A \perp B \mid C$.
A path from $A$ to $B$ is blocked by $C$ if it passes through a node where either (a) the path meets head-to-tail or tail-to-tail and that node is in $C$, or (b) the path meets head-to-head and neither that node nor any of its descendants is in $C$. If all paths are blocked, $A$ and $B$ are d-separated given $C$ — they are conditionally independent.
- Case 1 / 2 node in $C$ — observed node blocks the flow. Conditioning on a mediator or common cause separates the endpoints.
- Case 3 node not in $C$, no descendant in $C$ — unobserved collider blocks the flow. The two causes remain independent.
- Case 3 node in $C$ (or descendant in $C$) — observed collider opens the path. Explaining away activates: the two causes become dependent.
d-Separation is the complete characterization of conditional independence in a Bayesian network. Every independence statement that holds in the graph can be read off using d-separation — without computing any probabilities. It also guides causal inference: to identify a causal effect, we must block all non-causal paths using the right conditioning set.
Probabilistic Inference
Given a trained graphical model, the core task is answering probabilistic queries: what is the distribution over some set of variables, given observed evidence?
Query Variable
The variable(s) we want to compute a distribution over. E.g., $P(G \mid \cdots)$.
Evidence Variables
Variables whose values are observed (clamped). E.g., $A = a_1, D = d_1$.
Hidden Variables
All other variables — unobserved, must be marginalized (summed) out.
Naïve summation over all hidden variable configurations is exponential. Two efficient algorithms avoid this:
Variable Elimination
Eliminate hidden variables one at a time by summing them out, exploiting the factored structure of the joint. Efficient when the graph has small treewidth.
Belief Propagation
Pass local "messages" between neighboring nodes. Exact on trees; approximate on graphs with cycles (Loopy BP).
Belief Propagation
Belief Propagation (Pearl, 1988) is a message-passing algorithm that computes the marginal distribution of each node by having nodes exchange local messages with their neighbors.
Each node aggregates messages from all its neighbours and passes a summary of its beliefs onwards. Once messages converge, every node's belief equals its exact marginal probability given all evidence.
How It Works
- 01Initialization — fix evidence nodes to their observed values. All other nodes start with neutral (uniform) messages.
- 02Message passing — each node $X$ computes a message to neighbour $Y$ by: combining all incoming messages from its other neighbours, multiplying by its local conditional probability table (CPT), and marginalizing over $X$'s own state.
- 03Iteration — repeat until messages converge (stop changing between rounds).
- 04Belief calculation — each node's final belief is the product of all incoming messages, normalized. This is the marginal posterior $P(X \mid \text{evidence})$.
In a Chain
For a chain $E^+ - \cdots - X - \cdots - E^-$ where $E^+$ and $E^-$ are evidence at either end, information propagates in both directions and $E^+$ and $E^-$ are conditionally independent given $X$:
BP is guaranteed to give exact marginals on trees and polytrees in a single forward-backward pass. On graphs with cycles (loopy networks), it is approximate — but often remarkably accurate in practice.
Trees, Polytrees & Junction Trees
Trees
Each node has exactly one parent (except the root). BP is exact and runs in $O(n)$ with a single forward-backward pass. Every node receives evidence from both its children and its parent.
Polytrees
Nodes may have multiple parents, but there is only a single path between any two nodes (singly-connected). BP is still exact. Messages to a node are averaged over all parent configurations (marginalization).
General DAGs
Multiple paths between nodes (multiply-connected). BP is approximate. Convert to a junction tree for exact inference.
Junction Trees — Exact Inference on Any DAG
For graphs with cycles, we convert the original DAG into a junction tree (also called a clique tree) and then apply the polytree algorithm on this transformed structure.
- Moralization — marry all parents of common children (add edges between co-parents) and drop edge directions. This captures head-to-head dependencies.
- Triangulation — add fill-in edges to make the graph chordal (every cycle of length ≥ 4 has a chord).
- Clique tree construction — form a tree of maximal cliques with the running intersection property.
- BP on the clique tree — run standard BP; now exact because the clique tree has no cycles.
Junction tree inference is exponential in the treewidth of the graph — the size of the largest clique minus one. For graphs with low treewidth this is tractable; for dense graphs it becomes prohibitive, requiring approximate methods.
Applications
Windows Troubleshooter
Microsoft used a BN to diagnose printer problems (Heckerman & Breese). Evidence is user symptoms; query variables are root causes. Diagnostic inference identifies most likely faults.
Medical Diagnosis
BNs model symptoms, diseases, and test results. The Liver Disorders Network (Onisko et al.) infers disease probabilities from lab test evidence — a classic example of diagnostic inference.
Latent Variable Discovery
Market basket: instead of a direct link between diapers and baby food, a hidden node "baby at home" cleanly explains the association as a tail-to-tail (common cause) structure.
Naïve Bayes Classifier
Assume all features $x_j$ are conditionally independent given class $C$. Then $p(\mathbf{x} \mid C) = \prod_j p(x_j \mid C)$ — a BN with one parent ($C$) and $d$ child nodes (the features). Despite the "naïve" assumption, it often performs remarkably well.
Naïve Bayes in Detail
Each feature $x_j$ has its own conditional probability table given the class. For discrete features this is a lookup table; for continuous features we typically fit a Gaussian $p(x_j \mid C) = \mathcal{N}(\mu_{jC}, \sigma_{jC}^2)$.
Linear regression also fits naturally into the graphical model framework: weights $\mathbf{w}$ and noise $\varepsilon$ are nodes with their own prior densities; inputs $x$ are drawn from $P(x)$; output $r$ is the child. The posterior over $\mathbf{w}$ given data is obtained by diagnostic inference (Bayes' rule), yielding exactly the MAP/Bayesian regression solutions from Lecture 3.
Markov Random Fields
Not all dependencies are naturally directional. Image pixels, social networks, and physical systems have symmetric dependencies. Markov Random Fields (MRFs) model these with undirected graphs.
Directed BN
$X \to Y$: asymmetric. $P(Y \mid X)$ is specified. Arrows encode causal or generative direction. Edges carry conditional probability tables.
Undirected MRF
$X - Y$: symmetric. Joint defined via potential functions on cliques. No sense of parent/child. Edges encode compatibility.
where $\psi_C(\mathbf{x}_C) \geq 0$ is the potential function over clique $C$ (capturing how "compatible" the configuration $\mathbf{x}_C$ is), and $Z = \sum_\mathbf{x} \prod_C \psi_C(\mathbf{x}_C)$ is the partition function (normalizer).
Independence in MRFs
In an undirected graph, $A$ and $B$ are conditionally independent given $C$ if removing all nodes in $C$ leaves $A$ and $B$ in disconnected components. There is no head-to-head asymmetry — blocking is purely topological.
Moralization: BN → MRF
Any Bayesian network can be converted to an MRF by moralization: connect all parents of each common child (making them "married"), then drop edge directions. The head-to-head explaining-away structure is captured by including both parents in the same clique potential.
Ising Model
The simplest MRF: binary nodes $\{-1, +1\}$ on a grid, each connected to its neighbors. The potential favors neighboring nodes agreeing. Used to model ferromagnetism, image segmentation, and spin systems.
Factor Graphs
Factor graphs provide a unified representation that generalizes both directed and undirected graphical models. They make the factorization of the joint distribution explicit.
A bipartite graph with two types of nodes: variable nodes (circles) and factor nodes (squares). An edge connects a factor node to each variable it depends on. The joint is the product of all factor functions.
where $f_s$ is a factor (potential function) operating on the subset $\mathbf{x}_s$ of variables connected to factor node $s$.
- BN as factor graph — each conditional probability table $P(X_i \mid \text{Parents}(X_i))$ becomes a factor node connected to $X_i$ and all its parents.
- MRF as factor graph — each clique potential $\psi_C$ becomes a factor node connected to all variables in the clique.
- BP on factor graphs — the sum-product algorithm runs naturally on factor graphs, unifying BP for BNs and MRFs under a single message-passing framework.
Learning a Graphical Model
Learning a graphical model from data involves two separate problems that can be addressed jointly or sequentially:
Parameter Learning
Given a fixed graph structure, estimate the conditional probability tables (CPTs). For discrete variables with few parents: simple frequency counting (MLE). For continuous variables or parametric families: gradient-based optimization.
Structure Learning
Search over possible DAG structures to find the one that best fits the data, penalizing complexity. A state-space search over a score function combining goodness-of-fit and model complexity.
Structure Learning via BIC
The Bayesian Information Criterion (BIC) scores candidate graph structures, penalizing both poor fit and unnecessary complexity:
where $L$ is the likelihood of the observed data configurations under the model, $k$ is the number of parameters (proportional to the number of edges), and $n$ is the number of observations. Higher BIC is better.
Greedy Search Algorithm
- 01Start with an empty graph (no edges).
- 02Propose adding one edge. Estimate the resulting CPTs from data using MLE.
- 03Compute the likelihood: $\ln L = \sum_t \ln P(\text{datapoint}_t)$. If the predicted configuration probabilities are close to observed frequencies, $L$ is high.
- 04Compute BIC. If BIC improved, accept the edge. Repeat until no single edge addition improves BIC.
For Markov Random Fields, the partition function $Z$ appears in the likelihood, making exact gradient computation intractable for large models. Learning MRF parameters typically requires approximate methods: contrastive divergence, pseudolikelihood, or MCMC-based approaches.
Hidden Markov Models (Ch. 15). We apply graphical models to sequential data — sequences of observations where an underlying hidden state evolves over time. HMMs are a chain-structured graphical model with efficient exact inference via the forward-backward algorithm.