# Probabilistic clustering

## Probabilistic clustering

### Summary

• Fuzzy sets and clustering
• Fuzzy c-means
• Probabilistic Clustering: mixture models
• Expectation-Maximization, more formal
• External indexes: Rand index
• Assignment 2

# Fuzzy sets

## Fuzzy sets

• In conventional set theory, elements either belong or don't belong to a set
• In fuzzy set theory, each element has a $u_S(\mathbf{x}) \in [0,1]$ indicating the degree of membership to set $S$
• More formally, a fuzzy set $S$ is a set of ordered pairs
• $$S = \{(\mathbf{x},u_S(\mathbf{x}))|\mathbf{x} \in X \}$$

## Fuzzy sets

• Fuzzy sets allow modelling uncertainty
• Linguistic and conceptual uncertainty (old, large, important, ...)
• Wikimedia, CC BY-SA 3.0 fullofstars

## Fuzzy sets

• Fuzzy sets allow modelling uncertainty
• Linguistic and conceptual uncertainty (old, large, important, ...)
• Informational uncertainty (credit, security risks, ...)
• Note: fuzzy membership is not probability. It's a measure of similarity to some imprecise properties

## Fuzzy c-partition

• $\mathbf{U}(\mathbf{X})$ is a fuzzy c-partition of $X$ if:

$$0 \leq u_k(\mathbf{x}_n) \leq 1 \qquad \forall k, n$$
$$\sum \limits_{k=1}^{c} u_k(\mathbf{x}_n) = 1 \qquad \forall n$$
$$0\leq \sum \limits_{n=1}^{N} u_k(\mathbf{x}_n) \leq N \qquad \forall k$$

# Fuzzy c-Means

## Fuzzy c-Means

• From set $X$ of $N$ unlabelled data points
• Return the $c \times N$ membership matrix with
$u_k(\mathbf{x}_n)$, defining a fuzzy $c$-partition of $X$
• Return the $\{C_1, ..., C_c\}$ centroids
• Minimizing the squared error function: $$J_m(X,C) = \sum\limits_{k=1}^{c} \sum\limits_{n=1}^{N} u_k(\mathbf{x}_n)^m \|\mathbf{x}_n-\mathbf{c}_k\|^2 \qquad m \geq 1$$

## Fuzzy c-Means

• Minimizing the squared error function: $$J_m(X,C) = \sum\limits_{k=1}^{c} \sum\limits_{n=1}^{N} u_k(\mathbf{x}_n)^m \|\mathbf{x}_n-\mathbf{c}_k\|^2 \qquad m \geq 1$$
• Subject to: $$\sum \limits_{k=1}^{c} u_k(\mathbf{x}_n) = 1 \qquad \forall n$$
• and where $m$ is the degree of fuzzification; tipically, $m=2$

## Fuzzy c-Means

• The derivatives w.r.t. $u_k(\mathbf{x}_n)$ are zero at: $$u_k(\mathbf{x}_n)= \frac{\left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_k\|^2}\right)^{\frac{2}{m-1}}} {\sum \limits_{j=1}^{c} \left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_j\|^2}\right)^{\frac{2}{m-1}}}$$

## Fuzzy c-Means

• The derivatives w.r.t. $c_k$ are zero at:
• $$c_k = \frac { \sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m\mathbf{x}_n} {\sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m}$$
• That is, each centroid $c_k$ is the weighted mean of the example vectors using the membership values.

## Fuzzy c-Means algorithm

• Random initial centroids $\{C_1, ..., C_c\}$
• Compute $u$:
• $$u_k(\mathbf{x}_n)= \frac{\left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_k\|^2}\right)^{\frac{2}{m-1}}} {\sum \limits_{j=1}^{c} \left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_j\|^2}\right)^{\frac{2}{m-1}}}$$
• Recompute $c_k$:
• $$c_k = \frac { \sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m\mathbf{x}_n} {\sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m}$$
• Repeat until $t$ iterations or change $<\epsilon$

## Fuzzy c-Means algorithm

### Defuzzification

• If we want a crisp clustering, we'll need to convert the fuzzy membership function into $\{0,1\}$
• Maximum membership
• Nearest centroid

# Probabilistic Clustering

## Probabilistic Clustering

• Suppose we have a probabilistic model for the distribution of examples $X$: $$P(X,Y) = P(X|Y) P(Y)$$
• This would not only provide a way of clustering $X$ over $Y$ but also to generate examples from that distribution.
• The problem is that we do not know $Y$

## Probabilistic Clustering

• Recall, from EM:
• Set $X$ of observed data
• Set $Z$ of missing data or latent variables
• Unknown parameters $\theta$
• Likelihood function $L(\theta;X,Z) = p(X,Z|\theta)$

## Probabilistic Clustering

• Let us assume Gaussian distributions. In 1D: $$\mathcal{N}(x|\mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{- \frac{(x-\mu)^2}{2\sigma^2}}$$
Source: Wikimedia, public domain

## Probabilistic Clustering

• We can create a mixture of gaussians by adding the different distributions with weights

Source: Wikimedia, public domain

## Probabilistic Clustering

• For multivariate Gaussian distributions:
• $$\mathcal{N}(x|\mu, \Sigma) = \frac{1}{(2\pi|\Sigma|)^{1/2}} e^{- \frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)}$$
• Where $\mu$ is the vector of means and $\Sigma$ the covariance matrix
• A a mixture model of $k$ gaussians is:
• $$p(x) = \sum \limits_{k=1}^{K} \pi_k \mathcal{N}(x|\mu_k, \Sigma_k)$$
• where $\pi_k$ is the mixing coefficient, such that
• $$\sum \limits_{k=1}^{K} \pi_k = 1 \qquad 0 \leq \pi_k \leq 1 \ \forall k$$

## Probabilistic Clustering

• If we create a variable $\mathbf{z}$ so for each $\mathbf{x}$ that: $$z_k \in \{0,1\} \qquad \sum_{k}z_k = 1$$
• And we define the distribution:
• $$p(\mathbf{x},\mathbf{z}) = p(\mathbf{z}) p(\mathbf{x}|\mathbf{z})$$
• ($z_k$ are the latent variables assigning each example to each cluster)
• The marginal probability $p(z_k =1 ) = \pi_k$
• And $p(\mathbf{x}|z_k =1 ) = \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)$
• Which brings us back to:
• $$p(\mathbf{x}) = \sum_z p(\mathbf{z}) p(\mathbf{x}|\mathbf{z}) = \sum \limits_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)$$
• but now explicitly with the latent variables

## Probabilistic Clustering

• Using Bayes rule and the previous equations, we get the posterior prob. that $k$ responsible for $x_n$ (we'll call it $\gamma(z_{nk})$):
• $$p(z_k=1|\mathbf{x}) = \frac{p(z_k=1) p(\mathbf{x}|z_k=1)} {\sum \limits_{j=1}^{K} p(z_j=1) p(\mathbf{x}|z_j=1)} = \frac{\pi_k \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)} {\sum \limits_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}|\mu_j, \Sigma_j)}=\gamma(z_{nk})$$
• Since we want to maximize the log-likelihood of our parameters:
• $$\ln p(\mathbf{X |\pi, \mu, \Sigma}) = \sum \limits_{n=1}^{N}\ln \left(\sum \limits_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_n|\mu_k, \Sigma_k) \right)$$
• We want to find the maximum, where the derivative is 0

## Probabilistic Clustering

• Setting to zero the derivative w.r.t. $\mathbf{\mu}_k$:
• $$0 = - \sum \limits_{n=1}^{N} \frac{\pi_k \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)} {\sum \limits_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}|\mu_j, \Sigma_j)} \sum_k(\mathbf{x}_n - \mathbf{\mu}_k)$$ $$\mu_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})\mathbf{x}_n$$
• Where $N_k$ is the effective number of points in component $k$:
• $$N_k = \sum \limits_{n=1}^{N}\gamma(z_{nk})$$

## Probabilistic Clustering

• Setting to zero the derivative w.r.t. $\mathbf{\mu}_k$:
• $$\mu_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})\mathbf{x}_n$$
• Doing the same for $\Sigma_k$: $$\Sigma_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})(\mathbf{x}_n-\mu_k) (\mathbf{x}_n-\mu_k)^T$$
• Which is the sum of the covariances weighted by the $\gamma(z_{nk})$
• Finally, the $\pi_k$:
• $$\pi_k = \frac{N_k}{N}$$

## Probabilistic Clustering

• Now we do EM. First, expectation, computing the posterior probabilities $\gamma(z_{nk})$ from an initial guess of the parameters $\pi, \mu, \Sigma$:
• $$\gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(\mathbf{x}|\mu_k, \Sigma_k)} {\sum \limits_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}|\mu_j, \Sigma_j)}$$
• Then we use the posterior $\gamma(z_{nk})$ in the likelihood function and maximize w.r.t. $\pi, \mu, \Sigma$:
• $$\pi_k = \frac{N_k}{N} \qquad \mu_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})\mathbf{x}_n \qquad \Sigma_k = \frac{1}{N_k} \sum \limits_{n=1}^{N} \gamma(z_{nk})(\mathbf{x}_n-\mu_k)$$
• And repeat until convergence...

## Probabilistic Clustering

• Example, 2 Gaussian components, after 1 pass:

## Probabilistic Clustering

• Example, 2 Gaussian components, after 2 passes:

## Probabilistic Clustering

• Example, 2 Gaussian components, after 3 passes:

## Probabilistic Clustering

• 2 Gaussian components not good for these data

## Probabilistic Clustering

• Example, 3 Gaussian components, after 1 pass:

## Probabilistic Clustering

• Example, 3 Gaussian components, after 2 passes:

## Probabilistic Clustering

• Example, 3 Gaussian components, after 3 passes:

## Aprendizagem Automática

### Probabilistic Clustering

• Gaussian mixtures with Scikit-Learn:

class sklearn.mixture.GMM:
#arguments
n_components=1
covariance_type='diag' #constraints on covariance
n_iter=100

#attributes
weights_
means_
covars_


# Rand index

## Rand index

• In supervised learning, we saw the confusion matrix
Example Class
PredictionClass 1Class 0
Class 1True PositiveFalse Positive
Class 0False NegativeTrue Negative

• In unsupervised learning we cannot match examples with classes but we can consider all $N(N-1)/2$ pairs of points:
• True Positive: a pair from the same group placed in the same cluster
• True Negative: a pair from different groups placed in different clusters
• False Positive: a pair from different groups placed in the same cluster
• False Negative: a pair from the same group placed in different clusters

## Rand index

For all $N(N-1)/2$ pairs of examples
Same GroupDifferent Group
Same ClusterTrue PositiveFalse Positive
Different ClusterFalse NegativeTrue Negative

• Using this analogy, we can compute the Rand index, which is analogous to accuracy, and other scores:

$$precision = \frac{TP}{TP + FP} \qquad recall = \frac{TP}{TP + FN}$$
$$Rand = \frac{TP + TN}{N(N-1)/2} \qquad F_1 = 2 \times \frac{precision \times recall}{precision + recall}$$

# Assignment 2

## Assignment 2

• Clustering of seismic activity

## Assignment 2

• External indexes based on tectonic fault data

## Assignment 2

• Seismic events assigned to nearby relevant fault line

## Assignment 2

• Read the data file (use pandas?) and convert (lat,lon) to (x,y,z)

## Assignment 2

• Read the data file (use pandas?) and convert (lat,lon) to (x,y,z)
• Compare K-Means, DBSCAN and Gaussian Mixture models
• Use silhouette score as internal index
• Use Rand, Precision, Recall, F1 and Adjusted Rand as external indexes
• Explore main paramenters (k, $\epsilon$ and number of components)
• Also implement the $\epsilon$ selection method in DBSCAN, described in the paper (requires understanding the method...)
• Report is the most important part. Think and discuss the three algorithms and the different scores taking into account the relation between the distribution of earthquakes and the fault lines.

# Summary

## Probabilistic Clustering

### Summary

• Fuzzy sets: model uncertainty
• Fuzzy c-means: clustering with membership values
• Probabilistic clustering with gaussian mixtures
• EM algorithm: estimate posterior probabilities resulting from latent variables, maximize likelihood of parameters, repeat
• Rand index and Assignment 2