# Graphical Models

## Graphical Models

### Summary

• Graphical Models
• Markov Processes
• Hidden Markov Models
• Baum-Welch and Viterbi algorithms

# Graphical Models

## Graphical Models

### Probabilistic Graphical Models

• Convenient representation of probabilistic relations
• Bayesian networks: directed acyclic graphs representing conditional probability dependencies
• Markov Random Fields: undirected graphs representing dependencies
• Dynamic Bayesian networks: Bayesian networks representing time series

## Graphical Models

### Bayesian networks

• Variables, $a, b, c$, factorize joint probability
• $$P(a,b,c)= P(c|a,b)P(b|a)P(a)$$

## Graphical Models

### Bayesian networks

• Different possible factorizations:
• $$P(a,b,c)= P(a|b,c)P(b|c)P(c) \qquad P(a,b,c)= P(c|a,b)P(b|a)P(a)$$

## Graphical Models

### Bayesian networks

• We can use it to represent causal knowledge:

• The graph:

## Bayesian networks

• Node, one variable:

## Bayesian networks

• Arc representing conditional dependency:

## Bayesian networks

• Parents and offspring:

## Bayesian networks

• We can indicate an observed variable by shading the corresponding node

## Bayesian networks

• Or parameters as special nodes

## Bayesian networks

$$P(\mathbf{x}) = \prod \limits_{n=1}^{n} P\left(x_n|\text{parents}(x_n)\right)$$
• Product of probabilities of all variables conditioned on the joint probabilities of their parents (e.g. Bishop, p.362)

$$p(x_1,...,x_7)=p(x_1)p(x_2)p(x_3)p(x_4|x_1,x_2,x_3)p(x_5|x_1,x_3)p(x_6|x_4)p(x_7|x_4,x_5)$$

## Bayesian networks

### Conditional independence in BN

• Two variables are conditionally independent given a third iff:
• $$P(a,b|c)= P(a|c)P(b|c)$$
• This tells us which variables are relevant for a variable or a set of variables given some information
• Also written as: $$a \perp\!\!\!\perp b \ | \ c$$

## Bayesian networks

### Conditional independence in BN

• "Tail to tail" on $c$; $a,b$ not independent

• \begin{aligned} P(a,b,c)&=P(a|c)P(b|c)P(c)\\ \\ P(a,b)&=\sum\limits_{c}P(a|c)P(b|c)P(c)\\ &\neq P(a)P(b) \end{aligned}
• $a$ and $b$ are not independent because both depend on $c$

## Bayesian networks

### Conditional independence in BN

• $a,b$ conditionally independent given $c$

• \begin{aligned} P(a,b|c)&=\displaystyle\frac{P(a,b,c)}{P(c)}\\ &=\displaystyle\frac{P(a|c)P(b|c)P(c)}{P(c)}\\ &=P(a|c)P(b|c) \end{aligned}
• Knowing $c$, $a$ and $b$ do not provide information about each other

## Bayesian networks

### Conditional independence in BN

• Naïve Bayes classifier $$P(c,x_1,...,x_n) = P(c)P(x_1|c)...P(x_n|c)$$

## Bayesian networks

### Conditional independence in BN

• "Head to tail" on $c$; $a,b$ not independent ($c$ conveys information)
• $a,b$ conditionally independent given $c$
• $$P(a,b|c)= \frac{P(a)P(c|a)P(b|c)}{P(c)} =\displaystyle\frac{P(c|a)P(a)}{P(c)}P(b|c) = P(a|c)P(b|c)$$
• A known $c$ "blocks" information flow from $a$ to $b$

## Bayesian networks

### Conditional independence in BN

• "Head to head" on $c$; $a,b$ independent

• \begin{aligned} P(a,b,c)&=P(a)P(b)P(c|a,b) \\ \\ P(a,b)&=P(a)P(b)\sum\limits_{c}P(c|a,b) \\ &= P(a)P(b) \end{aligned}
• $c$ being dependent implies no correlation between $a$ and $b$

## Bayesian networks

### Conditional independence in BN

• $a,b$ conditionally dependent given $c$
\begin{aligned} \\ P(a,b|c)&=\displaystyle\frac{P(a,b,c)}{P(c)}\\ &=\displaystyle\frac{P(a)P(b)P(c|a,b)}{P(c)}\\ &\neq P(a|c)P(b|c) \end{aligned}
• Fixing $c$ can affect the probability distributions of $a$ and $b$

## Bayesian networks

### Conditional independence in BN

• $a,b$ independent
• Cable or router problems are independent

## Bayesian networks

### Conditional independence in BN

• $a,b$ conditionally dependent given $c$

• If we have no access, then the state of one changes the probabilities of the other

## Bayesian networks

### Conditional independence in BN

• Dependence-separation rules:
• Sets $A$ and $B$ are d-separated by $C$ if, for every path between $A$ and $B$:
• There is a node $d \in C$ from which an arrow comes out
• "head-to-tail" or "tail-to-tail": knowing $d$ makes adjacent independent
• There is a node $d$ "head-to-head" and neither $d$ nor any descendant is in $C$
• Knowing $d$ would make adjacent nodes dependent

## Bayesian networks

### Conditional independence in BN

• Dependence-separation (d-separation) allows assessing which evidence is relevant to which factor by graph analysis.
• Example: $$blood\ sugar \rightarrow stomach \ acidity \rightarrow \ hunger$$
• If this model is correct, then the correlation between blood sugar levels and hunger should disappear if we account for stomach acidity levels.

## Bayesian networks

### Conditional independence in BN

• Dependence-separation (d-separation) allows assessing which evidence is relevant to which factor by graph analysis.
• Example: $$hunger \leftarrow \ blood\ sugar \rightarrow stomach \ acidity$$
• If this model is correct, then the correlation between blood sugar levels and hunger should not disappear if we account for stomach acidity levels.

## Markov Random Fields

• Undirected graph can represent conditional independency better

### Markov Random Fields

• Undirected graph $G=(\mathbf{N},\mathbf{A})$
• Random variables $\mathbf{X}=x_n \ , n\in \mathbf{N}$
• Non-adjacent independent given the rest $$x_u \perp\!\!\!\perp x_v \ | X_{\mathbf{N}\backslash\{u,v\}} \ , \ u,v \not\in \mathbf{A}$$
• Each independent of others given neighbours $$x_v \perp\!\!\!\perp X_{\mathbf{N}\backslash cl(v)} \ | X_{ne(v)}$$
• $\mathbf{U} \perp\!\!\!\perp \mathbf{V}\ | \mathbf{S}$ if all paths between $\mathbf{U}$ and $\mathbf{V}$ go through $\mathbf{S}$

## Markov Random Fields

• Example (Bishop, p.384) $$\mathbf{A} \perp\!\!\!\perp \mathbf{B}\ | \mathbf{C}$$

## Markov Random Fields

• We can represent conditional independence in this way if the joint distribution is the product of functions of the cliques of the graph.
• A clique is a set of nodes that are all interconnected
• A maximal clique is a clique to which no nodes can be added and still remain a clique. We can write the joint distribution as a product of functions of maximal cliques.

## Markov Random Fields

• The joint probability is: $$P(\mathbf{X}) = \frac{1}{Z}\prod \limits_C \psi_C(\mathbf{x}_C)$$
• Where $\psi_C$ is a potential function for each (maximal) clique and $Z$ is a normalization constant $$Z =\sum \limits_{x}\prod \limits_C \psi_C(\mathbf{x}_C)$$

## Markov Random Fields

• Example, denoising image (Bishop, 388)

## Markov Random Fields

$$x_i, y_i \in \{-1,1\}$$

## Markov Random Fields

• "Energy" function $E(\mathbf{x},\mathbf{y})=$
• $x_i,x_j$ pairs: $-\beta \sum x_i x_j$
• $x_i,y_i$ pairs: $-\eta \sum x_i y_i$
• $x$ bias: $+ h \sum x_i$

• $$P(\mathbf{x},\mathbf{y})=\frac{1}{Z} e^{-E(\mathbf{x},\mathbf{y})}$$

## Markov Random Fields

• Find $\mathbf{x}$ that maximizes $\frac{1}{Z}e^{-E(\mathbf{x},\mathbf{y})}$

## Graphical Models

### Bayesian Networks vs Markov Random Fields

• These graph models represent dependencies differently
• Bayesian Networks are directed acyclical graphs of conditional dependencies
• Bayesian Networks can represent some dependencies that MRF cannot represent (e.g. "head-to-head")

## Graphical Models

### Bayesian Networks vs Markov Random Fields

• These graph models represent dependencies differently
• Random Markov Fields are undirected graphs
• Random Markov Fields can include cyclical dependencies that Bayesian Networks cannot represent

## Dynamic Bayesian Networks

• Dynamic Bayesian Networks relate variables over time steps.

### DBN and Markov Process

• One thing we can represent with a DBN is a Markov process.
• In a Markov process, the probability distribution of $x_t$ depends only on $x_t-1$.
• $$p(x_1,...,x_T) = p(x_1)\prod \limits_{t=2}^{T}p(x_t|x_{t-1})$$

# Hidden Markov Models

## HMM

### Hidden Markov Models

• A Hidden Markov Model is a Markov chain of hidden variables each with an observation conditioned on its state.
• $$p(x_1,...,x_T,y_1,...,y_T) = p(x_1)\prod \limits_{t=2}^{T}p(x_t|x_{t-1})\prod \limits_{t=1}^{T}p(y_t|x_t)$$

## HMM

• We can also represent a HMM with a graph showing state transitions and emissions.
• Tdunning CC BY 3.0, Wikimedia

## HMM

### Hidden Markov Model parameters

• To specify a HMM with $N$ states, assuming $P(x_t|x_{t-1})$, we need the $N\times N$ matrix: $$A_{i,j} = P(x_t= j | x_{t-1} = i)$$
• $N \times (N-1)$ independent parameters, because they are probabilities
• If $y_t$ can take one of $K$ values, the $K \times N$ matrix: $$B_{k,j} = P(y_t = k | x_t = j)$$
• $N \times (K-1)$ independent parameters
• Finally, the $N$ probabilities for $x_1$: $$\pi_i = P(x_1 = i)$$
• $N-1$ independent parameters

## HMM

### Uses for Hidden Markov Models

• Temporal pattern recognition
• Voice recognition, sign language, gait recognition
• Pattern recognition in sequences
• Text tagging, sequence comparison in bioinformatics, genome annotation (finding genes)

## HMM

• Predict next value (filtering)
• With $A, B, \pi$, the belief state at $t$ after $y_{1:t}$: $$\alpha_t(x_t) =p(x_t,y_{1:t})$$
• Too hard to compute all possible combinations
• Use the forward algorithm:
• Since $y_t$ only depends on $x_t$ and $x_t$ on $x_{t-1}$:
$\alpha_t(x_t) = p(y_t|x_t) \sum \limits_{x_{t-1}}^{N} p(x_t|x_{t-1})\alpha_{t-1}(x_{t-1})$
$\alpha_i(x_1) = p(y_1|x_1=i) \pi_i$

## HMM

• Knowing $A, B, \pi$ we can also compute the most likely sequence of $x_1, ..., x_t$ given $y_1, ..., y_t$ (Viterbi algorithm)
• Two matrices of $N\times T$:
• Probability of the most likely path for each state $x_j$
• Previous state of the most likely path for each state $x_j$
• These matrices can be built in order, from previous state
• Then we trace back the states of $x_j$ used at each step to get the most likely path
• Applications of the Viterbi algorithm
• Speech synthesis and recognition
• Speaker identification (diarization)
• Wi-Fi 802.11 (Viterbi decoding)
• Bioinformatics (sequence analysis)

## HMM

### Note: forward algorithm and Viterbi algorithm

• Not the same thing!
• Forward algorithm:
• Probability of state at each time given current and previous observations
• Viterbi algorithm:
• Most likely sequence of states given a set of observations

## HMM

• To train the model and find $(A, B, \pi)$ we use the Baum-Welch algorithm, which is another example of Expectation-Maximization
• Start with a guess for $(A, B, \pi)$
• The forward pass: $\alpha_i(t) = p(x_t=i,y_{1:t}|A,B,\pi)$
$$\alpha_i(1) = \pi_i B_{i,y_1} \qquad \alpha_i(t) = B_{i,y_t} \sum \limits_{n=1}^{N} \alpha_n(t-1)A_{n,i}$$
• The backward pass: $\beta_i(t) = p(y_{t+1:T}|x_t=i,A,B,\pi)$
$$\beta_i(T) = 1 \qquad \beta_i(t) = \sum \limits_{n=1}^{N} \beta_n(t+1)A_{i,n} B_{n,y_{t+1}}$$

## HMM

### Expectation step

• From this we can compute how to maximize the estimated likelihood function, based on previous $(A, B, \pi)$:
• $$\gamma_i(t) = p(x_t = i|\mathbf{y}, A, B, \pi) = \frac{\alpha_i(t)\beta_i(t)}{\sum \limits_{n=1}^{N} \alpha_n(t)\beta_n(t)}$$
$$\xi_{ij}(t) = p(x_t=i,x_{t+1}=j|\mathbf{y},A,B,\pi) =\frac{\alpha_i(t)A_{i,j}\beta_j(t+1)B_{j,y_{t+1}}} {\sum \limits_{n=1}^{N}\sum \limits_{m=1}^{N} \alpha_m(t) A_{m,n} \beta_m(t+1) B_{n,y_{t+1}}}$$

## HMM

### Maximization step

• Now we update $\pi$, fraction of each state at beginning:
$$\pi_i = \gamma_i(1)$$
• $A$, fraction of each transition w.r.t fraction of each state:
$$A_{i,j} = \frac{\sum_T \xi_{i,j}(t)}{\sum_T \gamma_{i}(t)}$$
• $B$, Fraction of each emmission w.r.t fraction of each state:
$$B_{i,k}= \frac{\sum_T 1_{y_t = k}\gamma_i(t)}{\sum_T \gamma_{i}(t)}$$

## HMM

• Baum-Welch algorithm (Expectation-Maximization)
• Compute joint probability of hidden state and emissions so far at each time.
• Compute probability of future emissions given hidden state at each time.
• Compute probability hidden state at each time given all emissions.
• Compute joint probability of consecutive hidden states at each time given all emissions.
• Use this to find new parameter values
• (Does not guarantee optimum and may overfit)

## HMM

• HMM are generative models (compute joint probabilities) so can be used to generate and classify
• To classify, we can compute the probability of the most likely state sequences for each HMM. Example: Pfam, speech recognition, time-series analysis, ...
• To generate, we can use the joint probabilities. E.g speech synthesis, generating values for frequency and duration.

# Summary

## Summary

• Graphical models: BN and RMF
• Markov process
• Hidden Markov Model
• Viterbi algorithm: most likely sequence
• Baum-Welch algorithm: EM training of HMM