Aprendizagem Automática

Graphical Models

Ludwig Krippahl

Graphical Models

Summary

  • Graphical Models
  • Markov Processes
  • Hidden Markov Models
    • Baum-Welch and Viterbi algorithms

Graphical Models

Graphical Models

Graphical Models

Probabilistic Graphical Models

  • Convenient representation of probabilistic relations
    • Bayesian networks: directed acyclic graphs representing conditional probability dependencies
    • Markov Random Fields: undirected graphs representing dependencies
    • Dynamic Bayesian networks: Bayesian networks representing time series

Graphical Models

Bayesian networks

  • Variables, $a, b, c$, factorize joint probability
  • $$P(a,b,c)= P(c|a,b)P(b|a)P(a)$$

Graphical Models

Bayesian networks

  • Different possible factorizations:
  • $$P(a,b,c)= P(a|b,c)P(b|c)P(c) \qquad P(a,b,c)= P(c|a,b)P(b|a)P(a)$$

Graphical Models

Bayesian networks

  • We can use it to represent causal knowledge:

Bayesian networks

  • The graph:

Bayesian networks

  • Node, one variable:

Bayesian networks

  • Arc representing conditional dependency:

Bayesian networks

  • Parents and offspring:

Bayesian networks

  • We can indicate an observed variable by shading the corresponding node

Bayesian networks

  • Or parameters as special nodes

Bayesian networks

$$P(\mathbf{x}) = \prod \limits_{n=1}^{n} P\left(x_n|\text{parents}(x_n)\right)$$
  • Product of probabilities of all variables conditioned on the joint probabilities of their parents (e.g. Bishop, p.362)

$$p(x_1,...,x_7)=p(x_1)p(x_2)p(x_3)p(x_4|x_1,x_2,x_3)p(x_5|x_1,x_3)p(x_6|x_4)p(x_7|x_4,x_5)$$

Bayesian networks

Conditional independence in BN

  • Two variables are conditionally independent given a third iff:
  • $$P(a,b|c)= P(a|c)P(b|c)$$
  • This tells us which variables are relevant for a variable or a set of variables given some information
  • Also written as: $$a \perp\!\!\!\perp b \ | \ c$$

Bayesian networks

Conditional independence in BN

  • "Tail to tail" on $c$; $a,b$ not independent

  • $$\begin{aligned} P(a,b,c)&=P(a|c)P(b|c)P(c)\\ \\ P(a,b)&=\sum\limits_{c}P(a|c)P(b|c)P(c)\\ &\neq P(a)P(b) \end{aligned}$$
    • $a$ and $b$ are not independent because both depend on $c$

Bayesian networks

Conditional independence in BN

  • $a,b$ conditionally independent given $c$

  • $$\begin{aligned} P(a,b|c)&=\displaystyle\frac{P(a,b,c)}{P(c)}\\ &=\displaystyle\frac{P(a|c)P(b|c)P(c)}{P(c)}\\ &=P(a|c)P(b|c) \end{aligned}$$
  • Knowing $c$, $a$ and $b$ do not provide information about each other

Bayesian networks

Conditional independence in BN

  • Naïve Bayes classifier $$P(c,x_1,...,x_n) = P(c)P(x_1|c)...P(x_n|c)$$

Bayesian networks

Conditional independence in BN

  • "Head to tail" on $c$; $a,b$ not independent ($c$ conveys information)
  • $a,b$ conditionally independent given $c$
  • $$ P(a,b|c)= \frac{P(a)P(c|a)P(b|c)}{P(c)} =\displaystyle\frac{P(c|a)P(a)}{P(c)}P(b|c) = P(a|c)P(b|c) $$
  • A known $c$ "blocks" information flow from $a$ to $b$

Bayesian networks

Conditional independence in BN

  • "Head to head" on $c$; $a,b$ independent

  • $$\begin{aligned} P(a,b,c)&=P(a)P(b)P(c|a,b) \\ \\ P(a,b)&=P(a)P(b)\sum\limits_{c}P(c|a,b) \\ &= P(a)P(b) \end{aligned}$$
  • $c$ being dependent implies no correlation between $a$ and $b$

Bayesian networks

Conditional independence in BN

  • $a,b$ conditionally dependent given $c$
    $$\begin{aligned} \\ P(a,b|c)&=\displaystyle\frac{P(a,b,c)}{P(c)}\\ &=\displaystyle\frac{P(a)P(b)P(c|a,b)}{P(c)}\\ &\neq P(a|c)P(b|c) \end{aligned}$$
  • Fixing $c$ can affect the probability distributions of $a$ and $b$

Bayesian networks

Conditional independence in BN

  • $a,b$ independent
  • Cable or router problems are independent

Bayesian networks

Conditional independence in BN

  • $a,b$ conditionally dependent given $c$

  • If we have no access, then the state of one changes the probabilities of the other

Bayesian networks

Conditional independence in BN

  • Dependence-separation rules:
  • Sets $A$ and $B$ are d-separated by $C$ if, for every path between $A$ and $B$:
    • There is a node $d \in C$ from which an arrow comes out
      • "head-to-tail" or "tail-to-tail": knowing $d$ makes adjacent independent
    • There is a node $d$ "head-to-head" and neither $d$ nor any descendant is in $C$
      • Knowing $d$ would make adjacent nodes dependent

Bayesian networks

Conditional independence in BN

  • Dependence-separation (d-separation) allows assessing which evidence is relevant to which factor by graph analysis.
  • Example: $$blood\ sugar \rightarrow stomach \ acidity \rightarrow \ hunger$$
  • If this model is correct, then the correlation between blood sugar levels and hunger should disappear if we account for stomach acidity levels.

Bayesian networks

Conditional independence in BN

  • Dependence-separation (d-separation) allows assessing which evidence is relevant to which factor by graph analysis.
  • Example: $$hunger \leftarrow \ blood\ sugar \rightarrow stomach \ acidity $$
  • If this model is correct, then the correlation between blood sugar levels and hunger should not disappear if we account for stomach acidity levels.

Markov Random Fields

  • Directed graphical models dependence separation more complex (head-to-tail, tail-to-tail, head-to-head)
  • Undirected graph can represent conditional independency better

Markov Random Fields

  • Undirected graph $G=(\mathbf{N},\mathbf{A})$
  • Random variables $\mathbf{X}=x_n \ , n\in \mathbf{N}$
  • Non-adjacent independent given the rest $$x_u \perp\!\!\!\perp x_v \ | X_{\mathbf{N}\backslash\{u,v\}} \ , \ u,v \not\in \mathbf{A}$$
  • Each independent of others given neighbours $$x_v \perp\!\!\!\perp X_{\mathbf{N}\backslash cl(v)} \ | X_{ne(v)}$$
  • $\mathbf{U} \perp\!\!\!\perp \mathbf{V}\ | \mathbf{S}$ if all paths between $\mathbf{U}$ and $\mathbf{V}$ go through $\mathbf{S}$

Markov Random Fields

  • Example (Bishop, p.384) $$\mathbf{A} \perp\!\!\!\perp \mathbf{B}\ | \mathbf{C}$$

Markov Random Fields

  • We can represent conditional independence in this way if the joint distribution is the product of functions of the cliques of the graph.
  • A clique is a set of nodes that are all interconnected
  • A maximal clique is a clique to which no nodes can be added and still remain a clique. We can write the joint distribution as a product of functions of maximal cliques.

Markov Random Fields

Markov Random Fields

Markov Random Fields

  • The joint probability is: $$P(\mathbf{X}) = \frac{1}{Z}\prod \limits_C \psi_C(\mathbf{x}_C)$$
  • Where $\psi_C$ is a potential function for each (maximal) clique and $Z$ is a normalization constant $$Z =\sum \limits_{x}\prod \limits_C \psi_C(\mathbf{x}_C)$$

Markov Random Fields

  • Example, denoising image (Bishop, 388)

Markov Random Fields


    $$x_i, y_i \in \{-1,1\}$$

Markov Random Fields

  • "Energy" function $E(\mathbf{x},\mathbf{y})=$
  • $x_i,x_j$ pairs: $-\beta \sum x_i x_j$
  • $x_i,y_i$ pairs: $-\eta \sum x_i y_i$
  • $x$ bias: $ + h \sum x_i$


  • $$P(\mathbf{x},\mathbf{y})=\frac{1}{Z} e^{-E(\mathbf{x},\mathbf{y})}$$

Markov Random Fields

  • Find $\mathbf{x}$ that maximizes $\frac{1}{Z}e^{-E(\mathbf{x},\mathbf{y})}$

Graphical Models

Bayesian Networks vs Markov Random Fields

  • These graph models represent dependencies differently
  • Bayesian Networks are directed acyclical graphs of conditional dependencies
    • Bayesian Networks can represent some dependencies that MRF cannot represent (e.g. "head-to-head")

Graphical Models

Bayesian Networks vs Markov Random Fields

  • These graph models represent dependencies differently
  • Random Markov Fields are undirected graphs
    • Random Markov Fields can include cyclical dependencies that Bayesian Networks cannot represent

Dynamic Bayesian Networks

  • Dynamic Bayesian Networks relate variables over time steps.

DBN and Markov Process

  • One thing we can represent with a DBN is a Markov process.
  • In a Markov process, the probability distribution of $x_t$ depends only on $x_t-1$.
  • $$p(x_1,...,x_T) = p(x_1)\prod \limits_{t=2}^{T}p(x_t|x_{t-1})$$

Aprendizagem Automática

Hidden Markov Models

HMM

Hidden Markov Models

  • A Hidden Markov Model is a Markov chain of hidden variables each with an observation conditioned on its state.
  • $$p(x_1,...,x_T,y_1,...,y_T) = p(x_1)\prod \limits_{t=2}^{T}p(x_t|x_{t-1})\prod \limits_{t=1}^{T}p(y_t|x_t)$$

HMM

  • We can also represent a HMM with a graph showing state transitions and emissions.
  • Tdunning CC BY 3.0, Wikimedia

HMM

Hidden Markov Model parameters

  • To specify a HMM with $N$ states, assuming $P(x_t|x_{t-1})$, we need the $N\times N$ matrix: $$A_{i,j} = P(x_t= j | x_{t-1} = i)$$
    • $N \times (N-1)$ independent parameters, because they are probabilities
  • If $y_t$ can take one of $K$ values, the $K \times N$ matrix: $$B_{k,j} = P(y_t = k | x_t = j)$$
    • $N \times (K-1)$ independent parameters
  • Finally, the $N$ probabilities for $x_1$: $$\pi_i = P(x_1 = i)$$
    • $N-1$ independent parameters

HMM

Uses for Hidden Markov Models

  • Temporal pattern recognition
    • Voice recognition, sign language, gait recognition
  • Pattern recognition in sequences
    • Text tagging, sequence comparison in bioinformatics, genome annotation (finding genes)

HMM

  • Predict next value (filtering)
    • With $A, B, \pi$, the belief state at $t$ after $y_{1:t}$: $$\alpha_t(x_t) =p(x_t,y_{1:t})$$
    • Too hard to compute all possible combinations
    • Use the forward algorithm:
  • Since $y_t$ only depends on $x_t$ and $x_t$ on $x_{t-1}$:
    $\alpha_t(x_t) = p(y_t|x_t) \sum \limits_{x_{t-1}}^{N} p(x_t|x_{t-1})\alpha_{t-1}(x_{t-1})$
    $\alpha_i(x_1) = p(y_1|x_1=i) \pi_i$

HMM

  • Knowing $A, B, \pi$ we can also compute the most likely sequence of $x_1, ..., x_t$ given $y_1, ..., y_t$ (Viterbi algorithm)
    • Two matrices of $N\times T$:
      • Probability of the most likely path for each state $x_j$
      • Previous state of the most likely path for each state $x_j$
    • These matrices can be built in order, from previous state
    • Then we trace back the states of $x_j$ used at each step to get the most likely path
  • Applications of the Viterbi algorithm
    • Speech synthesis and recognition
    • Speaker identification (diarization)
    • Wi-Fi 802.11 (Viterbi decoding)
    • Bioinformatics (sequence analysis)

HMM

Note: forward algorithm and Viterbi algorithm

  • Not the same thing!
  • Forward algorithm:
    • Probability of state at each time given current and previous observations
  • Viterbi algorithm:
    • Most likely sequence of states given a set of observations

HMM

  • To train the model and find $(A, B, \pi)$ we use the Baum-Welch algorithm, which is another example of Expectation-Maximization
    • Start with a guess for $(A, B, \pi)$
    • The forward pass: $\alpha_i(t) = p(x_t=i,y_{1:t}|A,B,\pi)$
$$ \alpha_i(1) = \pi_i B_{i,y_1} \qquad \alpha_i(t) = B_{i,y_t} \sum \limits_{n=1}^{N} \alpha_n(t-1)A_{n,i}$$
    • The backward pass: $\beta_i(t) = p(y_{t+1:T}|x_t=i,A,B,\pi)$
$$\beta_i(T) = 1 \qquad \beta_i(t) = \sum \limits_{n=1}^{N} \beta_n(t+1)A_{i,n} B_{n,y_{t+1}}$$

HMM

Expectation step

  • From this we can compute how to maximize the estimated likelihood function, based on previous $(A, B, \pi)$:
  • $$\gamma_i(t) = p(x_t = i|\mathbf{y}, A, B, \pi) = \frac{\alpha_i(t)\beta_i(t)}{\sum \limits_{n=1}^{N} \alpha_n(t)\beta_n(t)}$$
    $$\xi_{ij}(t) = p(x_t=i,x_{t+1}=j|\mathbf{y},A,B,\pi) =\frac{\alpha_i(t)A_{i,j}\beta_j(t+1)B_{j,y_{t+1}}} {\sum \limits_{n=1}^{N}\sum \limits_{m=1}^{N} \alpha_m(t) A_{m,n} \beta_m(t+1) B_{n,y_{t+1}}}$$

HMM

Maximization step

  • Now we update $\pi$, fraction of each state at beginning:
$$\pi_i = \gamma_i(1) $$
  • $A$, fraction of each transition w.r.t fraction of each state:
$$A_{i,j} = \frac{\sum_T \xi_{i,j}(t)}{\sum_T \gamma_{i}(t)}$$
  • $B$, Fraction of each emmission w.r.t fraction of each state:
$$B_{i,k}= \frac{\sum_T 1_{y_t = k}\gamma_i(t)}{\sum_T \gamma_{i}(t)}$$

HMM

  • Baum-Welch algorithm (Expectation-Maximization)
    • Compute joint probability of hidden state and emissions so far at each time.
    • Compute probability of future emissions given hidden state at each time.
    • Compute probability hidden state at each time given all emissions.
    • Compute joint probability of consecutive hidden states at each time given all emissions.
    • Use this to find new parameter values
  • (Does not guarantee optimum and may overfit)

HMM

  • HMM are generative models (compute joint probabilities) so can be used to generate and classify
  • To classify, we can compute the probability of the most likely state sequences for each HMM. Example: Pfam, speech recognition, time-series analysis, ...
  • To generate, we can use the joint probabilities. E.g speech synthesis, generating values for frequency and duration.

Aprendizagem Automática

Summary

Summary

  • Graphical models: BN and RMF
  • Markov process
  • Hidden Markov Model
  • Viterbi algorithm: most likely sequence
  • Baum-Welch algorithm: EM training of HMM

Further reading

  • Bishop, Sections 8-8.1.3; 8.2.1; 8.3-8.3.3; 13-13.2.2