Lecture 5 · Chapter 15 · 13 March

Hidden Markov Models

Until now data points were i.i.d. — independent and identically distributed. HMMs handle sequential data where observations depend on an underlying hidden state that evolves over time according to a Markov process.

Key reference: Rabiner (1989)
Builds on: Graphical Models (Lecture 4)
Date: 13 March

00 · Motivation

Why Sequences?

All previous models assumed data points were drawn independently and identically distributed (i.i.d.). Real-world data is often sequential — the present depends on the past.

🎙️

Temporal sequences

Speech: phonemes depend on the surrounding phonemes in a word (dictionary), words depend on surrounding words (syntax, semantics). Handwriting: pen movements follow smooth trajectories.

🧬

Spatial sequences

DNA: base pairs are not random — adjacent pairs are statistically correlated. Protein structures: amino acid sequences fold according to local dependencies.

The Key Shift

We move from modelling individual data points to modelling sequences of observations, where the probability of each observation depends on its position in the sequence and the underlying hidden state at that time.

01 · Building Block

Discrete Markov Chains

A Markov chain is the simplest sequential model. States are directly observable and transitions depend only on the current state — not on history.

First-Order Markov Property

The future is conditionally independent of the past given the present: $P(q_{t+1} = S_j \mid q_t, q_{t-1}, \ldots) = P(q_{t+1} = S_j \mid q_t)$

Symbol	Meaning	Constraint
$N$	Number of states $S_1, \ldots, S_N$	—
$a_{ij}$	Transition probability $P(q_{t+1}=S_j \mid q_t=S_i)$	$a_{ij} \geq 0$, $\sum_j a_{ij} = 1$
$\pi_i$	Initial state probability $P(q_1=S_i)$	$\sum_i \pi_i = 1$

Learning a Markov Chain

Given $K$ example sequences of length $T$, estimate the parameters by counting:

MLE for Transition Probabilities $$\hat{a}_{ij} = \frac{\text{number of transitions from } S_i \text{ to } S_j}{\text{total transitions from } S_i}$$

02 · Extension

Hidden Markov Models

In an HMM, the underlying states are not directly observable. Instead, each state produces an observable output ("emission") according to its own probability distribution. We only see the observations — the state sequence is hidden.

👻

Hidden layer

States $q_t \in \{S_1, \ldots, S_N\}$ follow a Markov chain — they transition according to $A$ but are never directly observed.

👁️

Visible layer

Observations $O_t \in \{v_1, \ldots, v_M\}$ are emitted by the hidden state at each time step, according to emission probabilities $B$.

Emission Probabilities $$b_j(m) \equiv P(O_t = v_m \mid q_t = S_j)$$

For the same observation sequence, there are exponentially many possible state sequences — the fundamental challenge of HMMs.

Circles = hidden states (transitions via A). Squares = visible observations (emitted via B, dashed lines).

03 · Model Definition

HMM Parameters $\lambda = (A, B, \Pi)$

Symbol	Name	Size	Meaning
$N$	State count	scalar	Number of hidden states
$M$	Observation count	scalar	Number of distinct observation symbols
$A$	Transition matrix	$N \times N$	$a_{ij} = P(q_{t+1}=S_j \mid q_t=S_i)$. Rows sum to 1.
$B$	Emission matrix	$N \times M$	$b_j(m) = P(O_t=v_m \mid q_t=S_j)$. Rows sum to 1.
$\Pi$	Initial vector	$N \times 1$	$\pi_i = P(q_1=S_i)$. Sums to 1.

Weather / Ice-Cream Example

Hidden states: weather $\{H\text{ot}, C\text{old}\}$. Observations: ice creams eaten $\{1, 2, 3\}$.

Example Parameterization $$\Pi = \begin{bmatrix}0.8\\0.2\end{bmatrix},\quad A = \begin{bmatrix}0.6 & 0.4\\0.5 & 0.5\end{bmatrix},\quad B = \begin{bmatrix}0.2 & 0.4 & 0.4\\0.5 & 0.4 & 0.1\end{bmatrix}$$

On hot days, eating 2 or 3 ice creams is equally likely. On cold days, eating just 1 is most likely. We never observe the weather directly — only the ice cream count.

04 · Core Problems

The Three Basic Problems of HMMs

Every application of an HMM reduces to one or more of three canonical problems (Rabiner, 1989):

Problem 1

Evaluation

Given model $\lambda$ and an observation sequence $O$, compute $P(O \mid \lambda)$. How likely is this sequence under the model?

Forward Algorithm

Problem 2

Decoding

Given model $\lambda$ and $O$, find the most probable state sequence $Q^*$: $\arg\max_Q P(Q \mid O, \lambda)$.

Viterbi Algorithm

Problem 3

Learning

Given training sequences $\mathcal{X} = \{O_k\}_k$, find $\lambda^*$ that maximizes $P(\mathcal{X} \mid \lambda)$.

Baum-Welch (EM)

05 · Problem 1

Evaluation — The Forward Algorithm

Computing $P(O \mid \lambda)$ by summing over all $N^T$ possible state sequences is exponentially expensive. The forward algorithm solves this efficiently using dynamic programming.

Forward Variable

$\alpha_t(i)$ = probability of having observed $O_1, O_2, \ldots, O_t$ and being in state $S_i$ at time $t$.

Forward Algorithm $$\text{Initialization: } \alpha_1(i) = \pi_i\, b_i(O_1)$$ $$\text{Recursion: } \alpha_{t+1}(j) = \left[\sum_{i=1}^{N} \alpha_t(i)\, a_{ij}\right] b_j(O_{t+1})$$ $$\text{Termination: } P(O \mid \lambda) = \sum_{i=1}^{N} \alpha_T(i)$$

The recursion says: the probability of being in state $j$ at time $t+1$ having seen $O_1 \ldots O_{t+1}$ is the sum over all possible previous states $i$ (weighted by $\alpha_t(i) \cdot a_{ij}$) multiplied by the emission probability $b_j(O_{t+1})$.

Complexity

Brute force: $O(N^T \cdot T)$. Forward algorithm: $O(N^2 T)$ — exponential becomes polynomial by reusing intermediate computations.

06 · Complement

The Backward Variable

Symmetric to the forward variable, the backward variable captures what happens after time $t$:

Backward Variable

$\beta_t(i)$ = probability of observing $O_{t+1}, O_{t+2}, \ldots, O_T$ given that we are in state $S_i$ at time $t$.

Backward Algorithm $$\text{Initialization: } \beta_T(i) = 1$$ $$\text{Recursion: } \beta_t(i) = \sum_{j=1}^{N} a_{ij}\, b_j(O_{t+1})\, \beta_{t+1}(j)$$

Why Both?

Combining $\alpha$ and $\beta$ gives us the smoothed posterior — the probability of being in state $i$ at time $t$ given the entire observation sequence: $\gamma_t(i) = P(q_t = S_i \mid O, \lambda) \propto \alpha_t(i)\,\beta_t(i)$. This is essential for the Baum-Welch learning algorithm.

07 · Problem 2

Decoding — Viterbi's Algorithm

We want the single most probable state sequence for a given observation sequence — not just the most probable state at each individual time step.

Why Not Greedy Per-Step?

Choosing the individually most likely state $q_t^* = \arg\max_i \gamma_t(i)$ at each step doesn't account for valid transitions — the resulting sequence may have zero probability because it contains impossible $a_{ij} = 0$ transitions.

Viterbi Variable

$\delta_t(i)$ = maximum probability over all state paths $q_1, \ldots, q_{t-1}$ of ending in state $S_i$ at time $t$ having observed $O_1, \ldots, O_t$. Like $\alpha$, but taking $\max$ instead of $\sum$.

Viterbi Algorithm $$\delta_1(i) = \pi_i\, b_i(O_1), \quad \psi_1(i) = 0$$ $$\delta_t(j) = \max_i\; \delta_{t-1}(i)\, a_{ij}\, b_j(O_t), \quad \psi_t(j) = \arg\max_i\; \delta_{t-1}(i)\, a_{ij}$$ $$q_T^* = \arg\max_i\; \delta_T(i)$$ $$q_t^* = \psi_{t+1}(q_{t+1}^*), \quad t = T-1, T-2, \ldots, 1$$

$\delta_t(j)$ — best probability of reaching state $j$ at step $t$ after seeing $O_1\ldots O_t$
$\psi_t(j)$ — the backpointer: which state at $t-1$ led to the best path ending in $j$ at $t$
Termination finds the best final state, then backtracking reconstructs the full path

POS Tagging

A classic application: Part-Of-Speech tagging. Hidden states are POS tags (noun, verb, adjective…); observations are words. Viterbi finds the most grammatically likely tag sequence for a sentence — used in every NLP pipeline before neural networks took over.

08 · Problem 3

Learning — Baum-Welch (EM)

Given a set of observation sequences, we want to find the parameters $\lambda^* = (A^*, B^*, \Pi^*)$ that maximize $P(\mathcal{X} \mid \lambda)$. There is no closed-form solution — we use Expectation Maximization.

Joint State-Transition Posterior

$\xi_t(i,j)$ = probability of being in state $S_i$ at time $t$ and transitioning to $S_j$ at time $t+1$, given the full observation sequence:

ξ (Xi) — Joint Posterior $$\xi_t(i,j) = \frac{\alpha_t(i)\, a_{ij}\, b_j(O_{t+1})\, \beta_{t+1}(j)}{P(O \mid \lambda)}$$

Baum-Welch Re-Estimation Equations $$\hat{a}_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)} \qquad \hat{b}_j(m) = \frac{\sum_{t: O_t = v_m} \gamma_t(j)}{\sum_{t=1}^{T} \gamma_t(j)} \qquad \hat{\pi}_i = \gamma_1(i)$$

EM Intuition

E-step — run forward and backward algorithms to compute $\gamma_t(i)$ and $\xi_t(i,j)$ — the "soft" counts of how often each state and transition was used
M-step — update $A$, $B$, $\Pi$ using the re-estimation equations above — normalize soft counts to get probabilities
Convergence — each iteration is guaranteed to increase $P(\mathcal{X} \mid \lambda)$. Converges to a local optimum. No known method for the global optimum.

Local Optima

Baum-Welch is sensitive to initialization and may converge to different local optima on different runs. In practice: run multiple random initializations and keep the best result.

09 · Connection

HMM as a Graphical Model

The HMM is a special case of the chain-structured graphical models from Lecture 4. Its DAG structure reveals exactly why the efficient algorithms work.

🔗

Head-to-tail chain (hidden states)

$q_1 \to q_2 \to q_3 \to \cdots$ forms a chain. Given $q_t$, all $q_{t'}$ for $t' < t$ are d-separated from all $q_{t''}$ for $t'' > t$. This is the Markov property in graphical model terms.

📡

Tail-to-tail (observations)

Each $O_t$ has $q_t$ as its only parent. Observations are conditionally independent of everything else given their generating state — this justifies the emission probability factorization.

Forward-Backward = Belief Propagation

The forward algorithm passes messages left-to-right ($\alpha$ variables). The backward algorithm passes messages right-to-left ($\beta$ variables). Together they are exactly belief propagation on the chain-structured graphical model of the HMM. Baum-Welch (EM) is simply BP applied iteratively to learn the CPTs.

Next Lecture — 20 Mar

Rule-Based Learners & Decision Trees (Ch. 9). We move away from probabilistic models to symbolic, interpretable learners — decision trees, pruning strategies, and rule induction algorithms.