- Graphical Models
- Markov Processes
- Hidden Markov Models
- Baum-Welch and Viterbi algorithms

- Convenient representation of probabilistic relations
Bayesian networks : directed acyclic graphs representing conditional probability dependenciesMarkov Random Fields : undirected graphs representing dependenciesDynamic Bayesian networks : Bayesian networks representing time series

- Variables, $a, b, c$, factorize joint probability $$P(a,b,c)= P(c|a,b)P(b|a)P(a)$$

- Different possible factorizations: $$P(a,b,c)= P(a|b,c)P(b|c)P(c) \qquad P(a,b,c)= P(c|a,b)P(b|a)P(a)$$

- We can use it to represent causal knowledge:

- The graph:

- Node, one variable:

- Arc representing conditional dependency:

- Parents and offspring:

- We can indicate an observed variable by shading the corresponding node

- Or parameters as special nodes

- Product of probabilities of all variables conditioned on the joint probabilities of their parents (e.g. Bishop, p.362)

$$p(x_1,...,x_7)=p(x_1)p(x_2)p(x_3)p(x_4|x_1,x_2,x_3)p(x_5|x_1,x_3)p(x_6|x_4)p(x_7|x_4,x_5)$$

- Two variables are conditionally independent given a third iff: $$P(a,b|c)= P(a|c)P(b|c)$$
- This tells us which variables are relevant for a variable or a set of variables given some information
- Also written as: $$a \perp\!\!\!\perp b \ | \ c$$

- "Tail to tail" on $c$; $a,b$ not independent
- $a$ and $b$ are not independent because both depend on $c$

$$\begin{aligned} P(a,b,c)&=P(a|c)P(b|c)P(c)\\ \\ P(a,b)&=\sum\limits_{c}P(a|c)P(b|c)P(c)\\ &\neq P(a)P(b) \end{aligned}$$

- $a,b$ conditionally independent given $c$
- Knowing $c$, $a$ and $b$ do not provide information about each other

$$\begin{aligned} P(a,b|c)&=\displaystyle\frac{P(a,b,c)}{P(c)}\\ &=\displaystyle\frac{P(a|c)P(b|c)P(c)}{P(c)}\\ &=P(a|c)P(b|c) \end{aligned}$$

- Naïve Bayes classifier $$P(c,x_1,...,x_n) = P(c)P(x_1|c)...P(x_n|c)$$

- "Head to tail" on $c$; $a,b$ not independent ($c$ conveys information)
- $a,b$ conditionally independent given $c$ $$ P(a,b|c)= \frac{P(a)P(c|a)P(b|c)}{P(c)} =\displaystyle\frac{P(c|a)P(a)}{P(c)}P(b|c) = P(a|c)P(b|c) $$
- A known $c$ "blocks" information flow from $a$ to $b$

- "Head to head" on $c$; $a,b$ independent
- $c$ being dependent implies no correlation between $a$ and $b$

$$\begin{aligned}
P(a,b,c)&=P(a)P(b)P(c|a,b) \\ \\
P(a,b)&=P(a)P(b)\sum\limits_{c}P(c|a,b) \\
&= P(a)P(b)
\end{aligned}$$

- $a,b$ conditionally dependent given $c$

$$\begin{aligned} \\ P(a,b|c)&=\displaystyle\frac{P(a,b,c)}{P(c)}\\ &=\displaystyle\frac{P(a)P(b)P(c|a,b)}{P(c)}\\ &\neq P(a|c)P(b|c) \end{aligned}$$ - Fixing $c$ can affect the probability distributions of $a$ and $b$

- $a,b$ independent

- Cable or router problems are independent

- $a,b$ conditionally dependent given $c$

- If we have no access, then the state of one changes the probabilities of the other

- Dependence-separation rules:
- Sets $A$ and $B$ are
*d-separated*by $C$ if, for every path between $A$ and $B$: - There is a node $d \in C$ from which an arrow comes out
- "head-to-tail" or "tail-to-tail": knowing $d$ makes adjacent independent
- There is a node $d$ "head-to-head" and neither $d$ nor any descendant is in $C$
- Knowing $d$ would make adjacent nodes dependent

- Dependence-separation (
*d-separation*) allows assessing which evidence is relevant to which factor by graph analysis. - Example: $$blood\ sugar \rightarrow stomach \ acidity \rightarrow \ hunger$$
- If this model is correct, then the correlation between blood sugar levels and hunger should disappear if we account for stomach acidity levels.

- Dependence-separation (
*d-separation*) allows assessing which evidence is relevant to which factor by graph analysis. - Example: $$hunger \leftarrow \ blood\ sugar \rightarrow stomach \ acidity $$
- If this model is correct, then the correlation between blood sugar levels and hunger should not disappear if we account for stomach acidity levels.

- Directed graphical models dependence separation more complex (head-to-tail, tail-to-tail, head-to-head)
- Undirected graph can represent conditional independency better

- Undirected graph $G=(\mathbf{N},\mathbf{A})$
- Random variables $\mathbf{X}=x_n \ , n\in \mathbf{N}$
- Non-adjacent independent given the rest $$x_u \perp\!\!\!\perp x_v \ | X_{\mathbf{N}\backslash\{u,v\}} \ , \ u,v \not\in \mathbf{A}$$
- Each independent of others given neighbours $$x_v \perp\!\!\!\perp X_{\mathbf{N}\backslash cl(v)} \ | X_{ne(v)}$$
- $\mathbf{U} \perp\!\!\!\perp \mathbf{V}\ | \mathbf{S}$ if all paths between $\mathbf{U}$ and $\mathbf{V}$ go through $\mathbf{S}$

- Example (Bishop, p.384) $$\mathbf{A} \perp\!\!\!\perp \mathbf{B}\ | \mathbf{C}$$

- We can represent conditional independence in this way if the joint distribution is the product of functions of the cliques of the graph.
- A
clique is a set of nodes that are all interconnected - A
maximal clique is a clique to which no nodes can be added and still remain a clique. We can write the joint distribution as a product of functions of maximal cliques.

- The joint probability is: $$P(\mathbf{X}) = \frac{1}{Z}\prod \limits_C \psi_C(\mathbf{x}_C)$$
- Where $\psi_C$ is a potential function for each (maximal) clique and $Z$ is a normalization constant $$Z =\sum \limits_{x}\prod \limits_C \psi_C(\mathbf{x}_C)$$

- Example, denoising image (Bishop, 388)

$$x_i, y_i \in \{-1,1\}$$

- "Energy" function $E(\mathbf{x},\mathbf{y})=$
- $x_i,x_j$ pairs: $-\beta \sum x_i x_j$
- $x_i,y_i$ pairs: $-\eta \sum x_i y_i$
- $x$ bias: $ + h \sum x_i$

$$P(\mathbf{x},\mathbf{y})=\frac{1}{Z} e^{-E(\mathbf{x},\mathbf{y})}$$

- Find $\mathbf{x}$ that maximizes $\frac{1}{Z}e^{-E(\mathbf{x},\mathbf{y})}$

- These graph models represent dependencies differently
- Bayesian Networks are directed acyclical graphs of conditional dependencies
- Bayesian Networks can represent some dependencies that MRF cannot represent (e.g. "head-to-head")

- These graph models represent dependencies differently
- Random Markov Fields are undirected graphs
- Random Markov Fields can include cyclical dependencies that Bayesian Networks cannot represent

- Dynamic Bayesian Networks relate variables over time steps.

- One thing we can represent with a DBN is a Markov process.
- In a
Markov process , the probability distribution of $x_t$ depends only on $x_t-1$.
$$p(x_1,...,x_T) = p(x_1)\prod \limits_{t=2}^{T}p(x_t|x_{t-1})$$

- A
Hidden Markov Model is a Markov chain of hidden variables each with an observation conditioned on its state.
$$p(x_1,...,x_T,y_1,...,y_T) = p(x_1)\prod \limits_{t=2}^{T}p(x_t|x_{t-1})\prod \limits_{t=1}^{T}p(y_t|x_t)$$

- We can also represent a HMM with a graph showing state transitions and emissions.

Tdunning CC BY 3.0, Wikimedia

- To specify a HMM with $N$ states, assuming $P(x_t|x_{t-1})$, we need the $N\times N$ matrix: $$A_{i,j} = P(x_t= j | x_{t-1} = i)$$
- $N \times (N-1)$ independent parameters, because they are probabilities
- If $y_t$ can take one of $K$ values, the $K \times N$ matrix: $$B_{k,j} = P(y_t = k | x_t = j)$$
- $N \times (K-1)$ independent parameters
- Finally, the $N$ probabilities for $x_1$: $$\pi_i = P(x_1 = i)$$
- $N-1$ independent parameters

- Temporal pattern recognition
- Voice recognition, sign language, gait recognition
- Pattern recognition in sequences
- Text tagging, sequence comparison in bioinformatics, genome annotation (finding genes)

- Predict next value (filtering)
- With $A, B, \pi$, the
belief state at $t$ after $y_{1:t}$: $$\alpha_t(x_t) =p(x_t,y_{1:t})$$ - Too hard to compute all possible combinations
- Use the
forward algorithm : - Since $y_t$ only depends on $x_t$ and $x_t$ on $x_{t-1}$:

$\alpha_t(x_t) = p(y_t|x_t) \sum \limits_{x_{t-1}}^{N} p(x_t|x_{t-1})\alpha_{t-1}(x_{t-1})$

$\alpha_i(x_1) = p(y_1|x_1=i) \pi_i$

- Knowing $A, B, \pi$ we can also compute the most likely sequence of $x_1, ..., x_t$ given $y_1, ..., y_t$ (
Viterbi algorithm) - Two matrices of $N\times T$:
- Probability of the most likely path for each state $x_j$
- Previous state of the most likely path for each state $x_j$
- These matrices can be built in order, from previous state
- Then we trace back the states of $x_j$ used at each step to get the most likely path
- Applications of the Viterbi algorithm
- Speech synthesis and recognition
- Speaker identification (diarization)
- Wi-Fi 802.11 (Viterbi decoding)
- Bioinformatics (sequence analysis)

- Not the same thing!
- Forward algorithm:
- Probability of state at each time given current and previous observations
- Viterbi algorithm:
- Most likely sequence of states given a set of observations

- To train the model and find $(A, B, \pi)$ we use the
Baum-Welch algorithm, which is another example of Expectation-Maximization - Start with a guess for $(A, B, \pi)$
- The forward pass: $\alpha_i(t) = p(x_t=i,y_{1:t}|A,B,\pi)$

- The backward pass: $\beta_i(t) = p(y_{t+1:T}|x_t=i,A,B,\pi)$

- From this we can compute how to maximize the estimated likelihood function, based on previous $(A, B, \pi)$: $$\gamma_i(t) = p(x_t = i|\mathbf{y}, A, B, \pi) = \frac{\alpha_i(t)\beta_i(t)}{\sum \limits_{n=1}^{N} \alpha_n(t)\beta_n(t)}$$

$$\xi_{ij}(t) = p(x_t=i,x_{t+1}=j|\mathbf{y},A,B,\pi) =\frac{\alpha_i(t)A_{i,j}\beta_j(t+1)B_{j,y_{t+1}}} {\sum \limits_{n=1}^{N}\sum \limits_{m=1}^{N} \alpha_m(t) A_{m,n} \beta_m(t+1) B_{n,y_{t+1}}}$$

- Now we update $\pi$, fraction of each state at beginning:

- $A$, fraction of each transition w.r.t fraction of each state:

- $B$, Fraction of each emmission w.r.t fraction of each state:

Baum-Welch algorithm (Expectation-Maximization)- Compute joint probability of hidden state and emissions so far at each time.
- Compute probability of future emissions given hidden state at each time.
- Compute probability hidden state at each time given all emissions.
- Compute joint probability of consecutive hidden states at each time given all emissions.
- Use this to find new parameter values
- (Does not guarantee optimum and may overfit)

- HMM are generative models (compute joint probabilities) so can be used to generate and classify
- To classify, we can compute the probability of the most likely state sequences for each HMM. Example: Pfam, speech recognition, time-series analysis, ...
- To generate, we can use the joint probabilities. E.g speech synthesis, generating values for frequency and duration.

- Graphical models: BN and RMF
- Markov process
- Hidden Markov Model
- Viterbi algorithm: most likely sequence
- Baum-Welch algorithm: EM training of HMM

- Bishop, Sections 8-8.1.3; 8.2.1; 8.3-8.3.3; 13-13.2.2