Aprendizagem Automática

Probabilistic clustering

Ludwig Krippahl

Probabilistic clustering

Summary

  • Fuzzy sets and clustering
  • Fuzzy c-means
  • Probabilistic Clustering: mixture models
  • Expectation-Maximization, more formal
  • External indexes: Rand index
  • Assignment 2

Aprendizagem Automática

Fuzzy sets

Fuzzy sets

  • In conventional set theory, elements either belong or don't belong to a set
  • In fuzzy set theory, each element has a $u_S(\mathbf{x}) \in [0,1]$ indicating the degree of membership to set $S$
  • More formally, a fuzzy set $S$ is a set of ordered pairs
  • $$S = \{(\mathbf{x},u_S(\mathbf{x}))|\mathbf{x} \in X \}$$

Fuzzy sets

  • Fuzzy sets allow modelling uncertainty
  • Linguistic and conceptual uncertainty (old, large, important, ...)
  • Wikimedia, CC BY-SA 3.0 fullofstars

Fuzzy sets

  • Fuzzy sets allow modelling uncertainty
  • Linguistic and conceptual uncertainty (old, large, important, ...)
  • Informational uncertainty (credit, security risks, ...)
  • Note: fuzzy membership is not probability. It's a measure of similarity to some imprecise properties

Fuzzy c-partition

  • $\mathbf{U}(\mathbf{X})$ is a fuzzy c-partition of $X$ if:

$$0 \leq u_k(\mathbf{x}_n) \leq 1 \qquad \forall k, n$$
$$\sum \limits_{k=1}^{c} u_k(\mathbf{x}_n) = 1 \qquad \forall n$$
$$0\leq \sum \limits_{n=1}^{N} u_k(\mathbf{x}_n) \leq N \qquad \forall k$$

Aprendizagem Automática

Fuzzy c-Means

Fuzzy c-Means

  • From set $X$ of $N$ unlabelled data points
  • Return the $c \times N$ membership matrix with
    $u_k(\mathbf{x}_n)$, defining a fuzzy $c$-partition of $X$
  • Return the $\{C_1, ..., C_c\}$ centroids
  • Minimizing the squared error function: $$J_m(X,C) = \sum\limits_{k=1}^{c} \sum\limits_{n=1}^{N} u_k(\mathbf{x}_n)^m \|\mathbf{x}_n-\mathbf{c}_k\|^2 \qquad m \geq 1$$

Fuzzy c-Means

  • Minimizing the squared error function: $$J_m(X,C) = \sum\limits_{k=1}^{c} \sum\limits_{n=1}^{N} u_k(\mathbf{x}_n)^m \|\mathbf{x}_n-\mathbf{c}_k\|^2 \qquad m \geq 1$$
  • Subject to: $$\sum \limits_{k=1}^{c} u_k(\mathbf{x}_n) = 1 \qquad \forall n$$
  • and where $m$ is the degree of fuzzification; tipically, $m=2$

Fuzzy c-Means

  • The derivatives w.r.t. $u_k(\mathbf{x}_n)$ are zero at: $$u_k(\mathbf{x}_n)= \frac{\left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_k\|^2}\right)^{\frac{2}{m-1}}} {\sum \limits_{j=1}^{c} \left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_j\|^2}\right)^{\frac{2}{m-1}}}$$

Fuzzy c-Means

  • The derivatives w.r.t. $c_k$ are zero at:
  • $$c_k = \frac { \sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m\mathbf{x}_n} {\sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m}$$
  • That is, each centroid $c_k$ is the weighted mean of the example vectors using the membership values.

Fuzzy c-Means algorithm

  • Random initial centroids $\{C_1, ..., C_c\}$
  • Compute $u$:
  • $$u_k(\mathbf{x}_n)= \frac{\left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_k\|^2}\right)^{\frac{2}{m-1}}} {\sum \limits_{j=1}^{c} \left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_j\|^2}\right)^{\frac{2}{m-1}}}$$
  • Recompute $c_k$:
  • $$c_k = \frac { \sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m\mathbf{x}_n} {\sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m}$$
  • Repeat until $t$ iterations or change $<\epsilon$

Fuzzy c-Means algorithm

Fuzzy c-Means algorithm

Defuzzification

  • If we want a crisp clustering, we'll need to convert the fuzzy membership function into $\{0,1\}$
  • Maximum membership
  • Nearest centroid

Aprendizagem Automática

Probabilistic Clustering

Probabilistic Clustering

  • Suppose we have a probabilistic model for the distribution of examples $X$: $$P(X,Y) = P(X|Y) P(Y)$$
  • This would not only provide a way of clustering $X$ over $Y$ but also to generate examples from that distribution.
  • The problem is that we do not know $Y$

Probabilistic Clustering

  • Recall, from EM:
  • Set $X$ of observed data
  • Set $Z$ of missing data or latent variables
  • Unknown parameters $\theta$
  • Likelihood function $L(\theta;X,Z) = p(X,Z|\theta)$

Probabilistic Clustering

  • Let us assume Gaussian distributions. In 1D: $$\mathcal{N}(x|\mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{- \frac{(x-\mu)^2}{2\sigma^2}}$$
    Source: Wikimedia, public domain

Probabilistic Clustering

  • We can create a mixture of gaussians by adding the different distributions with weights

    Source: Wikimedia, public domain

Probabilistic Clustering

  • For multivariate Gaussian distributions:
  • $$\mathcal{N}(x|\mu, \Sigma) = \frac{1}{(2\pi|\Sigma|)^{1/2}} e^{- \frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)}$$
    • Where $\mu$ is the vector of means and $\Sigma$ the covariance matrix
  • A a mixture model of $k$ gaussians is:
  • $$p(x) = \sum \limits_{k=1}^{K} \pi_k \mathcal{N}(x|\mu_k, \Sigma_k)$$
  • where $\pi_k$ is the mixing coefficient, such that
  • $$\sum \limits_{k=1}^{K} \pi_k = 1 \qquad 0 \leq \pi_k \leq 1 \ \forall k$$

Probabilistic Clustering

  • If we create a variable $\mathbf{z}$ so for each $\mathbf{x}$ that: $$z_k \in \{0,1\} \qquad \sum_{k}z_k = 1$$
  • And we define the distribution:
  • $$p(\mathbf{x},\mathbf{z}) = p(\mathbf{z}) p(\mathbf{x}|\mathbf{z})$$
    • ($z_k$ are the latent variables assigning each example to each cluster)
  • The marginal probability $p(z_k =1 ) = \pi_k$
  • And $p(\mathbf{x}|z_k =1 ) = \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)$
  • Which brings us back to:
  • $$p(\mathbf{x}) = \sum_z p(\mathbf{z}) p(\mathbf{x}|\mathbf{z}) = \sum \limits_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)$$
  • but now explicitly with the latent variables

Probabilistic Clustering

  • Using Bayes rule and the previous equations, we get the posterior prob. that $k$ responsible for $x_n$ (we'll call it $\gamma(z_{nk})$):
  • $$p(z_k=1|\mathbf{x}) = \frac{p(z_k=1) p(\mathbf{x}|z_k=1)} {\sum \limits_{j=1}^{K} p(z_j=1) p(\mathbf{x}|z_j=1)} = \frac{\pi_k \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)} {\sum \limits_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}|\mu_j, \Sigma_j)}=\gamma(z_{nk}) $$
  • Since we want to maximize the log-likelihood of our parameters:
  • $$\ln p(\mathbf{X |\pi, \mu, \Sigma}) = \sum \limits_{n=1}^{N}\ln \left(\sum \limits_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_n|\mu_k, \Sigma_k) \right)$$
  • We want to find the maximum, where the derivative is 0

Probabilistic Clustering

  • Setting to zero the derivative w.r.t. $\mathbf{\mu}_k$:
  • $$0 = - \sum \limits_{n=1}^{N} \frac{\pi_k \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)} {\sum \limits_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}|\mu_j, \Sigma_j)} \sum_k(\mathbf{x}_n - \mathbf{\mu}_k)$$ $$\mu_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})\mathbf{x}_n$$
  • Where $N_k$ is the effective number of points in component $k$:
  • $$N_k = \sum \limits_{n=1}^{N}\gamma(z_{nk})$$

Probabilistic Clustering

  • Setting to zero the derivative w.r.t. $\mathbf{\mu}_k$:
  • $$\mu_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})\mathbf{x}_n$$
  • Doing the same for $\Sigma_k$: $$\Sigma_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})(\mathbf{x}_n-\mu_k) (\mathbf{x}_n-\mu_k)^T$$
  • Which is the sum of the covariances weighted by the $\gamma(z_{nk})$
  • Finally, the $\pi_k$:
  • $$\pi_k = \frac{N_k}{N}$$

Probabilistic Clustering

  • Now we do EM. First, expectation, computing the posterior probabilities $\gamma(z_{nk})$ from an initial guess of the parameters $\pi, \mu, \Sigma$:
  • $$\gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)} {\sum \limits_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}|\mu_j, \Sigma_j)}$$
  • Then we use the posterior $\gamma(z_{nk})$ in the likelihood function and maximize w.r.t. $\pi, \mu, \Sigma$:
  • $$\pi_k = \frac{N_k}{N} \qquad \mu_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})\mathbf{x}_n \qquad \Sigma_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})(\mathbf{x}_n-\mu_k)$$
  • And repeat until convergence...

Probabilistic Clustering

  • Example, 2 Gaussian components, after 1 pass:

Probabilistic Clustering

  • Example, 2 Gaussian components, after 2 passes:

Probabilistic Clustering

  • Example, 2 Gaussian components, after 3 passes:

Probabilistic Clustering

  • 2 Gaussian components not good for these data

Probabilistic Clustering

  • Example, 3 Gaussian components, after 1 pass:

Probabilistic Clustering

  • Example, 3 Gaussian components, after 2 passes:

Probabilistic Clustering

  • Example, 3 Gaussian components, after 3 passes:

Aprendizagem Automática

Probabilistic Clustering

  • Gaussian mixtures with Scikit-Learn:
    
    class sklearn.mixture.GMM:
    #arguments
    n_components=1
    covariance_type='diag' #constraints on covariance
    n_iter=100
    
    #attributes
    weights_
    means_
    covars_
    

Probabilistic Clustering

Rand index

Rand index

  • In supervised learning, we saw the confusion matrix
Example Class
PredictionClass 1Class 0
Class 1True PositiveFalse Positive
Class 0False NegativeTrue Negative

  • In unsupervised learning we cannot match examples with classes but we can consider all $N(N-1)/2$ pairs of points:
    • True Positive: a pair from the same group placed in the same cluster
    • True Negative: a pair from different groups placed in different clusters
    • False Positive: a pair from different groups placed in the same cluster
    • False Negative: a pair from the same group placed in different clusters

Rand index

For all $N(N-1)/2$ pairs of examples
Same GroupDifferent Group
Same ClusterTrue PositiveFalse Positive
Different ClusterFalse NegativeTrue Negative

  • Using this analogy, we can compute the Rand index, which is analogous to accuracy, and other scores:

$$ precision = \frac{TP}{TP + FP} \qquad recall = \frac{TP}{TP + FN}$$
$$Rand = \frac{TP + TN}{N(N-1)/2} \qquad F_1 = 2 \times \frac{precision \times recall}{precision + recall}$$

Aprendizagem Automática

Assignment 2

Assignment 2

  • Clustering of seismic activity

Assignment 2

  • External indexes based on tectonic fault data

Assignment 2

  • Seismic events assigned to nearby relevant fault line

Assignment 2

  • Read the data file (use pandas?) and convert (lat,lon) to (x,y,z)

Assignment 2

  • Read the data file (use pandas?) and convert (lat,lon) to (x,y,z)
  • Compare K-Means, DBSCAN and Gaussian Mixture models
  • Use silhouette score as internal index
  • Use Rand, Precision, Recall, F1 and Adjusted Rand as external indexes
  • Explore main paramenters (k, $\epsilon$ and number of components)
  • Also implement the $\epsilon$ selection method in DBSCAN, described in the paper (requires understanding the method...)
  • Report is the most important part. Think and discuss the three algorithms and the different scores taking into account the relation between the distribution of earthquakes and the fault lines.

Probabilistic Clustering

Summary

Probabilistic Clustering

Summary

  • Fuzzy sets: model uncertainty
  • Fuzzy c-means: clustering with membership values
  • Probabilistic clustering with gaussian mixtures
  • EM algorithm: estimate posterior probabilities resulting from latent variables, maximize likelihood of parameters, repeat
  • Rand index and Assignment 2

Further reading