Lecture 4 · Chapter 14 · 06 March

Probabilistic
Graphical Models

Graphical models give us a language for representing complex probability distributions over many interdependent variables — encoding conditional independence structure visually, and enabling efficient inference and learning at scale.

Also known as
Bayesian Networks · Belief Networks
Key references
Pearl (1988, 2000) · Jensen (1996)
Date
06 March
00 · Introduction

What Are Graphical Models?

Definition

A graphical model is a directed acyclic graph (DAG) where nodes are random variables and directed edges represent direct probabilistic dependencies. Edge labels encode conditional probabilities.

🔵

Nodes

Each node is a random variable (a hypothesis). The label on a node $X$ describes the probability $P(X)$ — our degree of belief in the truth of $X$.

➡️

Edges

A directed edge $X \to Y$ represents a direct influence from $X$ on $Y$. The edge label is $P(Y \mid X)$ — the conditional probability of $Y$ given $X$.

🌐

Structure

The graph must be a DAG — directed and acyclic. This prevents circular reasoning and guarantees a valid factorization of the joint distribution.

📐

Parameters

The conditional probability tables (CPTs) on each edge. Learning a graphical model means learning both structure (the DAG) and parameters (the CPTs).

Why Graphical Models?

  • Compact representation — encode the joint distribution of $n$ variables without specifying all $2^n$ probabilities. Local structure breaks the problem into small conditional tables.
  • Flexible inference — any variable can be treated as input (evidence) or output (query). No fixed input/output designation needed.
  • Hidden variables — nodes whose values are never observed in data can be included. They model latent causes (e.g., "baby at home" explaining co-occurrence of diapers and baby food).
  • Causal reasoning — directed edges suggest causal mechanisms, enabling both causal (forward) and diagnostic (backward) inference.
Important caveat

An edge $X \to Y$ does not always imply causality. Graphical models represent statistical dependencies. Causality requires additional assumptions beyond correlation structure.

01 · Core Concept

Conditional Independence

The key concept behind graphical models is conditional independence — knowing one variable can make two others independent, or dependent, depending on graph structure.

Independence

$X$ and $Y$ are independent if knowing $X$ tells you nothing new about $Y$:

$P(X, Y) = P(X)\,P(Y)$

Conditional Independence

$X$ and $Y$ are conditionally independent given $Z$ if knowing $Z$ makes knowing $X$ irrelevant for predicting $Y$:

$P(X, Y \mid Z) = P(X \mid Z)\,P(Y \mid Z)$

equivalently: $P(X \mid Y, Z) = P(X \mid Z)$

There are three canonical patterns in a DAG that determine when conditional independence holds. Each arises from the direction of edges meeting at a node.

02 · Canonical Structures

The Three Canonical Cases

Every path in a Bayesian network passes through one of three local configurations. Understanding how information flows (or is blocked) through each is the foundation of all inference in graphical models.

Case 1 Head-to-Tail  —  Chain
X Y Z
$P(X,Y,Z) = P(X)\,P(Y \mid X)\,P(Z \mid Y)$

$X$ influences $Y$, which influences $Z$. This is a chain: information flows along the path.

When $Y$ is unobserved: $X$ and $Z$ are dependent — information flows freely through $Y$.

When $Y$ is observed: $X \perp Z \mid Y$. Knowing $Y$ completely accounts for $X$'s influence on $Z$. $Y$ blocks the path.

Case 2 Tail-to-Tail  —  Common Cause (Fork)
X Y Z
$P(X,Y,Z) = P(X)\,P(Y \mid X)\,P(Z \mid X)$

$X$ is a common cause of $Y$ and $Z$. Even without a direct link, $Y$ and $Z$ are correlated through their shared parent $X$.

When $X$ is unobserved: $Y$ and $Z$ are dependent — observing $Y$ updates our belief about $X$, which then affects $Z$.

When $X$ is observed: $Y \perp Z \mid X$. Knowing $X$ screens off the influence. $X$ blocks the path.

Case 3 Head-to-Head  —  Common Effect (Collider / V-structure)
X Y Z
$P(X,Y,Z) = P(X)\,P(Y)\,P(Z \mid X,Y)$

$X$ and $Y$ are independent causes of $Z$. This is a collider — the opposite behaviour of Cases 1 and 2.

When $Z$ is unobserved: $X \perp Y$. No path connects them. $Z$ blocks the path when unobserved.

When $Z$ is observed (or any descendant of $Z$): $X$ and $Y$ become dependent — observing the common effect opens the path. This gives rise to explaining away.

The Critical Asymmetry

Cases 1 and 2 block information when the middle node is observed. Case 3 blocks information when the middle node is not observed. This reversal is the most counter-intuitive property of graphical models — and the source of explaining away.

03 · Running Example

Causal vs. Diagnostic Inference

The sprinkler network is the canonical example for graphical model inference. Variables: Cloudy $C$ → Rain $R$, Sprinkler $S$ → Wet grass $W$. Also $C \to S$ and $R, S \to W$.

⬇️

Causal Inference (Top-Down)

Reasoning from causes to effects. Sum over all configurations of intermediate variables.

"If the sprinkler is on, what is $P(W)$?"

$P(W \mid S) = P(W \mid R,S)\,P(R) + P(W \mid \lnot R,S)\,P(\lnot R)$

$= 0.95 \times 0.4 + 0.9 \times 0.6 = 0.92$

⬆️

Diagnostic Inference (Bottom-Up)

Reasoning from observations back to their causes. Apply Bayes' rule to invert the arc.

"If the grass is wet, what is $P(S)$?"

$P(S \mid W) = 0.35 > 0.20 = P(S)$

Observing the wet grass increases the probability that the sprinkler was on.

Multi-Hop Inference

Graphical models allow inference to propagate across the entire network. For example, computing $P(W \mid C)$ requires summing over all combinations of $R$ and $S$:

Causal Inference Across Multiple Hops $$P(W \mid C) = \sum_{R,S} P(W \mid R,S)\,P(R,S \mid C) = \sum_{R,S} P(W \mid R,S)\,P(R \mid C)\,P(S \mid C)$$

The last equality uses the fact that $R \perp S \mid C$ (tail-to-tail structure with $C$ observed — Case 2).

No Fixed Input/Output

The same model supports any query direction. Evidence can be clamped at any node; any other node becomes a query. This is what makes graphical models so powerful as generative models.

04 · Collider Inference

Explaining Away

Explaining away (also called Berkson's paradox) is the phenomenon where observing a common effect makes its independent causes negatively correlated — one explanation "explains away" the other.

The Sprinkler Example

Rain $R$ and Sprinkler $S$ are marginally independent — $P(R,S) = P(R)\,P(S)$. But once we observe that the grass is wet ($W = 1$, the common effect), they become dependent. Knowing it rained reduces the probability that the sprinkler caused the wet grass:

Explaining Away — Conditional Dependence After Observing Effect $$P(S \mid W) = 0.35 \quad \xrightarrow{\text{also observe } R} \quad P(S \mid W, R) < P(S \mid W)$$

Rain explains away the sprinkler as the cause of the wet grass. Two previously independent hypotheses compete to explain the observed evidence. This is a ubiquitous pattern in diagnosis, fault detection, and scientific reasoning.

Real-World Instances

Medical diagnosis: observing a symptom (wet grass) and learning one disease (rain) is present reduces the probability of another disease (sprinkler) being responsible. Fault detection: detecting a bug in one module reduces the probability of a bug elsewhere when a system failure is observed.

05 · General Rule

d-Separation

d-Separation generalizes the three canonical cases to arbitrary graph structures. It tells us, for any two nodes $A$ and $B$ and any conditioning set $C$, whether $A \perp B \mid C$.

Definition

A path from $A$ to $B$ is blocked by $C$ if it passes through a node where either (a) the path meets head-to-tail or tail-to-tail and that node is in $C$, or (b) the path meets head-to-head and neither that node nor any of its descendants is in $C$. If all paths are blocked, $A$ and $B$ are d-separated given $C$ — they are conditionally independent.

  • Case 1 / 2 node in $C$ — observed node blocks the flow. Conditioning on a mediator or common cause separates the endpoints.
  • Case 3 node not in $C$, no descendant in $C$ — unobserved collider blocks the flow. The two causes remain independent.
  • Case 3 node in $C$ (or descendant in $C$) — observed collider opens the path. Explaining away activates: the two causes become dependent.
Why This Matters

d-Separation is the complete characterization of conditional independence in a Bayesian network. Every independence statement that holds in the graph can be read off using d-separation — without computing any probabilities. It also guides causal inference: to identify a causal effect, we must block all non-causal paths using the right conditioning set.

06 · Queries

Probabilistic Inference

Given a trained graphical model, the core task is answering probabilistic queries: what is the distribution over some set of variables, given observed evidence?

Query Variable

The variable(s) we want to compute a distribution over. E.g., $P(G \mid \cdots)$.

👁️

Evidence Variables

Variables whose values are observed (clamped). E.g., $A = a_1, D = d_1$.

👻

Hidden Variables

All other variables — unobserved, must be marginalized (summed) out.

General Query $$P(G \mid A = a_1,\, D = d_1) = \frac{\sum_{B,C,E,F} P(A,B,C,D,E,F,G)}{\sum_{B,C,E,F,G} P(A,B,C,D,E,F,G)}$$

Naïve summation over all hidden variable configurations is exponential. Two efficient algorithms avoid this:

Variable Elimination

Eliminate hidden variables one at a time by summing them out, exploiting the factored structure of the joint. Efficient when the graph has small treewidth.

Belief Propagation

Pass local "messages" between neighboring nodes. Exact on trees; approximate on graphs with cycles (Loopy BP).

07 · Algorithm

Belief Propagation

Belief Propagation (Pearl, 1988) is a message-passing algorithm that computes the marginal distribution of each node by having nodes exchange local messages with their neighbors.

Core Idea

Each node aggregates messages from all its neighbours and passes a summary of its beliefs onwards. Once messages converge, every node's belief equals its exact marginal probability given all evidence.

How It Works

  1. 01
    Initialization — fix evidence nodes to their observed values. All other nodes start with neutral (uniform) messages.
  2. 02
    Message passing — each node $X$ computes a message to neighbour $Y$ by: combining all incoming messages from its other neighbours, multiplying by its local conditional probability table (CPT), and marginalizing over $X$'s own state.
  3. 03
    Iteration — repeat until messages converge (stop changing between rounds).
  4. 04
    Belief calculation — each node's final belief is the product of all incoming messages, normalized. This is the marginal posterior $P(X \mid \text{evidence})$.

In a Chain

For a chain $E^+ - \cdots - X - \cdots - E^-$ where $E^+$ and $E^-$ are evidence at either end, information propagates in both directions and $E^+$ and $E^-$ are conditionally independent given $X$:

Belief at an Internal Node (Chain) $$P(X \mid E^+, E^-) \propto P(E^+ \mid X)\,P(E^- \mid X)\,P(X)$$
Exact on Trees

BP is guaranteed to give exact marginals on trees and polytrees in a single forward-backward pass. On graphs with cycles (loopy networks), it is approximate — but often remarkably accurate in practice.

08 · Network Topologies

Trees, Polytrees & Junction Trees

🌲

Trees

Each node has exactly one parent (except the root). BP is exact and runs in $O(n)$ with a single forward-backward pass. Every node receives evidence from both its children and its parent.

🌳

Polytrees

Nodes may have multiple parents, but there is only a single path between any two nodes (singly-connected). BP is still exact. Messages to a node are averaged over all parent configurations (marginalization).

🕸️

General DAGs

Multiple paths between nodes (multiply-connected). BP is approximate. Convert to a junction tree for exact inference.

Junction Trees — Exact Inference on Any DAG

For graphs with cycles, we convert the original DAG into a junction tree (also called a clique tree) and then apply the polytree algorithm on this transformed structure.

  • Moralization — marry all parents of common children (add edges between co-parents) and drop edge directions. This captures head-to-head dependencies.
  • Triangulation — add fill-in edges to make the graph chordal (every cycle of length ≥ 4 has a chord).
  • Clique tree construction — form a tree of maximal cliques with the running intersection property.
  • BP on the clique tree — run standard BP; now exact because the clique tree has no cycles.
Complexity

Junction tree inference is exponential in the treewidth of the graph — the size of the largest clique minus one. For graphs with low treewidth this is tractable; for dense graphs it becomes prohibitive, requiring approximate methods.

09 · Use Cases

Applications

🖥️

Windows Troubleshooter

Microsoft used a BN to diagnose printer problems (Heckerman & Breese). Evidence is user symptoms; query variables are root causes. Diagnostic inference identifies most likely faults.

🏥

Medical Diagnosis

BNs model symptoms, diseases, and test results. The Liver Disorders Network (Onisko et al.) infers disease probabilities from lab test evidence — a classic example of diagnostic inference.

🛒

Latent Variable Discovery

Market basket: instead of a direct link between diapers and baby food, a hidden node "baby at home" cleanly explains the association as a tail-to-tail (common cause) structure.

🧠

Naïve Bayes Classifier

Assume all features $x_j$ are conditionally independent given class $C$. Then $p(\mathbf{x} \mid C) = \prod_j p(x_j \mid C)$ — a BN with one parent ($C$) and $d$ child nodes (the features). Despite the "naïve" assumption, it often performs remarkably well.

Naïve Bayes in Detail

Naïve Bayes — Conditional Independence of Features $$p(\mathbf{x} \mid C) = \prod_{j=1}^{d} p(x_j \mid C)$$ $$P(C \mid \mathbf{x}) \propto P(C) \prod_{j=1}^{d} p(x_j \mid C)$$

Each feature $x_j$ has its own conditional probability table given the class. For discrete features this is a lookup table; for continuous features we typically fit a Gaussian $p(x_j \mid C) = \mathcal{N}(\mu_{jC}, \sigma_{jC}^2)$.

Linear Regression as a Graphical Model

Linear regression also fits naturally into the graphical model framework: weights $\mathbf{w}$ and noise $\varepsilon$ are nodes with their own prior densities; inputs $x$ are drawn from $P(x)$; output $r$ is the child. The posterior over $\mathbf{w}$ given data is obtained by diagnostic inference (Bayes' rule), yielding exactly the MAP/Bayesian regression solutions from Lecture 3.

10 · Undirected Models

Markov Random Fields

Not all dependencies are naturally directional. Image pixels, social networks, and physical systems have symmetric dependencies. Markov Random Fields (MRFs) model these with undirected graphs.

Directed BN

$X \to Y$: asymmetric. $P(Y \mid X)$ is specified. Arrows encode causal or generative direction. Edges carry conditional probability tables.

Undirected MRF

$X - Y$: symmetric. Joint defined via potential functions on cliques. No sense of parent/child. Edges encode compatibility.

MRF Joint Distribution $$P(\mathbf{x}) = \frac{1}{Z} \prod_{C} \psi_C(\mathbf{x}_C)$$

where $\psi_C(\mathbf{x}_C) \geq 0$ is the potential function over clique $C$ (capturing how "compatible" the configuration $\mathbf{x}_C$ is), and $Z = \sum_\mathbf{x} \prod_C \psi_C(\mathbf{x}_C)$ is the partition function (normalizer).

Independence in MRFs

In an undirected graph, $A$ and $B$ are conditionally independent given $C$ if removing all nodes in $C$ leaves $A$ and $B$ in disconnected components. There is no head-to-head asymmetry — blocking is purely topological.

Moralization: BN → MRF

Any Bayesian network can be converted to an MRF by moralization: connect all parents of each common child (making them "married"), then drop edge directions. The head-to-head explaining-away structure is captured by including both parents in the same clique potential.

Ising Model

The simplest MRF: binary nodes $\{-1, +1\}$ on a grid, each connected to its neighbors. The potential favors neighboring nodes agreeing. Used to model ferromagnetism, image segmentation, and spin systems.

11 · Unified Representation

Factor Graphs

Factor graphs provide a unified representation that generalizes both directed and undirected graphical models. They make the factorization of the joint distribution explicit.

Factor Graph

A bipartite graph with two types of nodes: variable nodes (circles) and factor nodes (squares). An edge connects a factor node to each variable it depends on. The joint is the product of all factor functions.

Factor Graph Factorization $$P(\mathbf{x}) \propto \prod_s f_s(\mathbf{x}_s)$$

where $f_s$ is a factor (potential function) operating on the subset $\mathbf{x}_s$ of variables connected to factor node $s$.

  • BN as factor graph — each conditional probability table $P(X_i \mid \text{Parents}(X_i))$ becomes a factor node connected to $X_i$ and all its parents.
  • MRF as factor graph — each clique potential $\psi_C$ becomes a factor node connected to all variables in the clique.
  • BP on factor graphs — the sum-product algorithm runs naturally on factor graphs, unifying BP for BNs and MRFs under a single message-passing framework.
12 · Learning

Learning a Graphical Model

Learning a graphical model from data involves two separate problems that can be addressed jointly or sequentially:

📊

Parameter Learning

Given a fixed graph structure, estimate the conditional probability tables (CPTs). For discrete variables with few parents: simple frequency counting (MLE). For continuous variables or parametric families: gradient-based optimization.

🔍

Structure Learning

Search over possible DAG structures to find the one that best fits the data, penalizing complexity. A state-space search over a score function combining goodness-of-fit and model complexity.

Structure Learning via BIC

The Bayesian Information Criterion (BIC) scores candidate graph structures, penalizing both poor fit and unnecessary complexity:

BIC Score $$\text{BIC} = \log L - \frac{k}{2}\log n$$

where $L$ is the likelihood of the observed data configurations under the model, $k$ is the number of parameters (proportional to the number of edges), and $n$ is the number of observations. Higher BIC is better.

Greedy Search Algorithm

  1. 01
    Start with an empty graph (no edges).
  2. 02
    Propose adding one edge. Estimate the resulting CPTs from data using MLE.
  3. 03
    Compute the likelihood: $\ln L = \sum_t \ln P(\text{datapoint}_t)$. If the predicted configuration probabilities are close to observed frequencies, $L$ is high.
  4. 04
    Compute BIC. If BIC improved, accept the edge. Repeat until no single edge addition improves BIC.
Learning MRFs Is Harder

For Markov Random Fields, the partition function $Z$ appears in the likelihood, making exact gradient computation intractable for large models. Learning MRF parameters typically requires approximate methods: contrastive divergence, pseudolikelihood, or MCMC-based approaches.


Next Lecture — 13 Mar

Hidden Markov Models (Ch. 15). We apply graphical models to sequential data — sequences of observations where an underlying hidden state evolves over time. HMMs are a chain-structured graphical model with efficient exact inference via the forward-backward algorithm.