Hidden Markov Models
Until now data points were i.i.d. — independent and identically distributed. HMMs handle sequential data where observations depend on an underlying hidden state that evolves over time according to a Markov process.
Why Sequences?
All previous models assumed data points were drawn independently and identically distributed (i.i.d.). Real-world data is often sequential — the present depends on the past.
Temporal sequences
Speech: phonemes depend on the surrounding phonemes in a word (dictionary), words depend on surrounding words (syntax, semantics). Handwriting: pen movements follow smooth trajectories.
Spatial sequences
DNA: base pairs are not random — adjacent pairs are statistically correlated. Protein structures: amino acid sequences fold according to local dependencies.
We move from modelling individual data points to modelling sequences of observations, where the probability of each observation depends on its position in the sequence and the underlying hidden state at that time.
Discrete Markov Chains
A Markov chain is the simplest sequential model. States are directly observable and transitions depend only on the current state — not on history.
The future is conditionally independent of the past given the present: $P(q_{t+1} = S_j \mid q_t, q_{t-1}, \ldots) = P(q_{t+1} = S_j \mid q_t)$
| Symbol | Meaning | Constraint |
|---|---|---|
| $N$ | Number of states $S_1, \ldots, S_N$ | — |
| $a_{ij}$ | Transition probability $P(q_{t+1}=S_j \mid q_t=S_i)$ | $a_{ij} \geq 0$, $\sum_j a_{ij} = 1$ |
| $\pi_i$ | Initial state probability $P(q_1=S_i)$ | $\sum_i \pi_i = 1$ |
Learning a Markov Chain
Given $K$ example sequences of length $T$, estimate the parameters by counting:
Hidden Markov Models
In an HMM, the underlying states are not directly observable. Instead, each state produces an observable output ("emission") according to its own probability distribution. We only see the observations — the state sequence is hidden.
Hidden layer
States $q_t \in \{S_1, \ldots, S_N\}$ follow a Markov chain — they transition according to $A$ but are never directly observed.
Visible layer
Observations $O_t \in \{v_1, \ldots, v_M\}$ are emitted by the hidden state at each time step, according to emission probabilities $B$.
For the same observation sequence, there are exponentially many possible state sequences — the fundamental challenge of HMMs.
Circles = hidden states (transitions via A). Squares = visible observations (emitted via B, dashed lines).
HMM Parameters $\lambda = (A, B, \Pi)$
| Symbol | Name | Size | Meaning |
|---|---|---|---|
| $N$ | State count | scalar | Number of hidden states |
| $M$ | Observation count | scalar | Number of distinct observation symbols |
| $A$ | Transition matrix | $N \times N$ | $a_{ij} = P(q_{t+1}=S_j \mid q_t=S_i)$. Rows sum to 1. |
| $B$ | Emission matrix | $N \times M$ | $b_j(m) = P(O_t=v_m \mid q_t=S_j)$. Rows sum to 1. |
| $\Pi$ | Initial vector | $N \times 1$ | $\pi_i = P(q_1=S_i)$. Sums to 1. |
Weather / Ice-Cream Example
Hidden states: weather $\{H\text{ot}, C\text{old}\}$. Observations: ice creams eaten $\{1, 2, 3\}$.
On hot days, eating 2 or 3 ice creams is equally likely. On cold days, eating just 1 is most likely. We never observe the weather directly — only the ice cream count.
The Three Basic Problems of HMMs
Every application of an HMM reduces to one or more of three canonical problems (Rabiner, 1989):
Evaluation
Given model $\lambda$ and an observation sequence $O$, compute $P(O \mid \lambda)$. How likely is this sequence under the model?
Forward AlgorithmDecoding
Given model $\lambda$ and $O$, find the most probable state sequence $Q^*$: $\arg\max_Q P(Q \mid O, \lambda)$.
Viterbi AlgorithmLearning
Given training sequences $\mathcal{X} = \{O_k\}_k$, find $\lambda^*$ that maximizes $P(\mathcal{X} \mid \lambda)$.
Baum-Welch (EM)Evaluation — The Forward Algorithm
Computing $P(O \mid \lambda)$ by summing over all $N^T$ possible state sequences is exponentially expensive. The forward algorithm solves this efficiently using dynamic programming.
$\alpha_t(i)$ = probability of having observed $O_1, O_2, \ldots, O_t$ and being in state $S_i$ at time $t$.
The recursion says: the probability of being in state $j$ at time $t+1$ having seen $O_1 \ldots O_{t+1}$ is the sum over all possible previous states $i$ (weighted by $\alpha_t(i) \cdot a_{ij}$) multiplied by the emission probability $b_j(O_{t+1})$.
Brute force: $O(N^T \cdot T)$. Forward algorithm: $O(N^2 T)$ — exponential becomes polynomial by reusing intermediate computations.
The Backward Variable
Symmetric to the forward variable, the backward variable captures what happens after time $t$:
$\beta_t(i)$ = probability of observing $O_{t+1}, O_{t+2}, \ldots, O_T$ given that we are in state $S_i$ at time $t$.
Combining $\alpha$ and $\beta$ gives us the smoothed posterior — the probability of being in state $i$ at time $t$ given the entire observation sequence: $\gamma_t(i) = P(q_t = S_i \mid O, \lambda) \propto \alpha_t(i)\,\beta_t(i)$. This is essential for the Baum-Welch learning algorithm.
Decoding — Viterbi's Algorithm
We want the single most probable state sequence for a given observation sequence — not just the most probable state at each individual time step.
Choosing the individually most likely state $q_t^* = \arg\max_i \gamma_t(i)$ at each step doesn't account for valid transitions — the resulting sequence may have zero probability because it contains impossible $a_{ij} = 0$ transitions.
$\delta_t(i)$ = maximum probability over all state paths $q_1, \ldots, q_{t-1}$ of ending in state $S_i$ at time $t$ having observed $O_1, \ldots, O_t$. Like $\alpha$, but taking $\max$ instead of $\sum$.
- $\delta_t(j)$ — best probability of reaching state $j$ at step $t$ after seeing $O_1\ldots O_t$
- $\psi_t(j)$ — the backpointer: which state at $t-1$ led to the best path ending in $j$ at $t$
- Termination finds the best final state, then backtracking reconstructs the full path
A classic application: Part-Of-Speech tagging. Hidden states are POS tags (noun, verb, adjective…); observations are words. Viterbi finds the most grammatically likely tag sequence for a sentence — used in every NLP pipeline before neural networks took over.
Learning — Baum-Welch (EM)
Given a set of observation sequences, we want to find the parameters $\lambda^* = (A^*, B^*, \Pi^*)$ that maximize $P(\mathcal{X} \mid \lambda)$. There is no closed-form solution — we use Expectation Maximization.
$\xi_t(i,j)$ = probability of being in state $S_i$ at time $t$ and transitioning to $S_j$ at time $t+1$, given the full observation sequence:
EM Intuition
- E-step — run forward and backward algorithms to compute $\gamma_t(i)$ and $\xi_t(i,j)$ — the "soft" counts of how often each state and transition was used
- M-step — update $A$, $B$, $\Pi$ using the re-estimation equations above — normalize soft counts to get probabilities
- Convergence — each iteration is guaranteed to increase $P(\mathcal{X} \mid \lambda)$. Converges to a local optimum. No known method for the global optimum.
Baum-Welch is sensitive to initialization and may converge to different local optima on different runs. In practice: run multiple random initializations and keep the best result.
HMM as a Graphical Model
The HMM is a special case of the chain-structured graphical models from Lecture 4. Its DAG structure reveals exactly why the efficient algorithms work.
Head-to-tail chain (hidden states)
$q_1 \to q_2 \to q_3 \to \cdots$ forms a chain. Given $q_t$, all $q_{t'}$ for $t' < t$ are d-separated from all $q_{t''}$ for $t'' > t$. This is the Markov property in graphical model terms.
Tail-to-tail (observations)
Each $O_t$ has $q_t$ as its only parent. Observations are conditionally independent of everything else given their generating state — this justifies the emission probability factorization.
The forward algorithm passes messages left-to-right ($\alpha$ variables). The backward algorithm passes messages right-to-left ($\beta$ variables). Together they are exactly belief propagation on the chain-structured graphical model of the HMM. Baum-Welch (EM) is simply BP applied iteratively to learn the CPTs.
Rule-Based Learners & Decision Trees (Ch. 9). We move away from probabilistic models to symbolic, interpretable learners — decision trees, pruning strategies, and rule induction algorithms.