Aprendizagem Automática

Clustering and Manifold Learning

Ludwig Krippahl

Clustering and Manifold Learning

Summary

  • Fuzzy sets and clustering
  • Fuzzy c-means
  • Manifold learning
    • tSNE and Isomap
  • External indexes: Rand index
  • Assignment 2

Aprendizagem Automática

Fuzzy sets

Fuzzy sets

  • In conventional set theory, elements either belong or don't belong to a set
  • In fuzzy set theory, each element has a $u_S(\mathbf{x}) \in [0,1]$ indicating the degree of membership to set $S$
  • More formally, a fuzzy set $S$ is a set of ordered pairs
  • $$S = \{(\mathbf{x},u_S(\mathbf{x}))|\mathbf{x} \in X \}$$

Fuzzy sets

  • Fuzzy sets allow modelling uncertainty
  • Linguistic and conceptual uncertainty (old, large, important, ...)

Wikimedia, CC BY-SA 3.0 fullofstars

Fuzzy sets

  • Fuzzy sets allow modelling uncertainty
  • Linguistic and conceptual uncertainty (old, large, important, ...)
  • Informational uncertainty (credit, security risks, ...)
  • Note: fuzzy membership is not probability. It's a measure of similarity to some imprecise properties

Fuzzy c-partition

  •  $\mathbf{U}(\mathbf{X})$ is a fuzzy c-partition of $X$ if:


  • $$0 \leq u_k(\mathbf{x}_n) \leq 1 \qquad \forall k, n$$

    $$\sum \limits_{k=1}^{c} u_k(\mathbf{x}_n) = 1 \qquad \forall n$$

    $$0\leq \sum \limits_{n=1}^{N} u_k(\mathbf{x}_n) \leq N \qquad \forall k$$

Aprendizagem Automática

Fuzzy c-Means

Fuzzy c-Means

  • From set $X$ of $N$ unlabelled data points
  • Return the $c \times N$ membership matrix with $u_k(\mathbf{x}_n)$, defining a fuzzy $c$-partition of $X$
  • Return the $\{C_1, ..., C_c\}$ centroids
  • Minimizing the squared error function:
  • $$J_m(X,C) = \sum\limits_{k=1}^{c} \sum\limits_{n=1}^{N} u_k(\mathbf{x}_n)^m\|\mathbf{x}_n-\mathbf{c}_k\|^2 \qquad m \geq 1$$

Fuzzy c-Means

  • Minimizing the squared error function:
  • $$J_m(X,C) = \sum\limits_{k=1}^{c} \sum\limits_{n=1}^{N} u_k(\mathbf{x}_n)^m\|\mathbf{x}_n-\mathbf{c}_k\|^2 \qquad m \geq 1$$
  • Subject to:
  • $$\sum \limits_{k=1}^{c} u_k(\mathbf{x}_n) = 1 \qquad \forall n$$
  • and where $m$ is the degree of fuzzification ; tipically, $m=2$

Fuzzy c-Means

  • The derivatives w.r.t. $u_k(\mathbf{x}_n)$ are zero at:
  • $$u_k(\mathbf{x}_n)=\frac{\left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_k\|^2}\right)^{\frac{2}{m-1}}} {\sum \limits_{j=1}^{c} \left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_j\|^2}\right)^{\frac{2}{m-1}}}$$
  • The derivatives w.r.t. $c_k$ are zero at:
  • $$c_k = \frac { \sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m\mathbf{x}_n}{\sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m}$$
  • That is, each centroid $c_k$ is the weighted mean of the example vectors using the membership values.

Fuzzy c-Means algorithm

  • Random initial centroids $\{C_1, ..., C_c\}$
  • Compute $u$:
  • $$u_k(\mathbf{x}_n)=\frac{\left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_k\|^2}\right)^{\frac{2}{m-1}}}{\sum \limits_{j=1}^{c} \left(\frac{1}{\|\mathbf{x}_n-\mathbf{c}_j\|^2}\right)^{\frac{2}{m-1}}}$$
  • Recompute $c_k$:
  • $$c_k = \frac { \sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m\mathbf{x}_n}{\sum \limits_{n=1}^{N}u_k(\mathbf{x}_n)^m}$$
  • Repeat until $t$ iterations or change $<\epsilon$

Fuzzy c-Means algorithm

  • The result is a clustering similar to k-means, but with membership values:

Source: "Simulated Annealing - Advances, Applications and Hybridizations" Ed. Marcos de Sales Guerra Tsuzuki, CC BY 3.0

Fuzzy c-Means algorithm

Defuzzification

  • If we want a crisp clustering, we'll need to convert the fuzzy membership function into $\{0,1\}$
    • Maximum membership
    • Nearest centroid

Aprendizagem Automática

Manifold Learning

Manifold Learning

Manifold

  • A set of points such that the neighbourhood of each is homomorphic to an euclidean space
    • Example: the surface of a sphere

Manifold Learning

  • Data may cover a lower dimension manifold of the space

Manifold Learning

  • Learn lower dimension embeddings of data manifold

t-SNE

t-distributed stochastic neighbor embedding

  • Probability $p_{j|i}$ of point $x_i$ choosing $x_j$ as neighbour with Gaussian distribution:
  • $$p_{j|i} = \frac{exp(-||x_i - x_j||^2/2\sigma_i^2)}{\sum\limits_{i\neq j}exp(-||x_i - x_j||^2/2\sigma_i^2)} \qquad (p_{i|i} = 0)$$
  • From here, joint probability $p_{ji} = (p_{j|i}+p_{i|j})/2$
  • Joint probability of the low-dimensional counterparts $y_i$ and $y_j$ uses a t-Student distribution with 1 degree of freedom:
  • $$q_{j|i} = \frac{(1+||y_i - y_j||)^{-1}}{\sum\limits_{i\neq j}(1+||y_i - y_j||)^{-1}} \qquad (q_{i|i} = 0)$$

t-SNE

t-distributed stochastic neighbor embedding

  • Compute the values of $\mathbf{y}$ that minimize the Kullback–Leibler divergence of $q_{ij}$ with respect to $p_{ij}$:

  • $$KL(P||Q) = \sum\limits_{i\neq j} p_{ij}\log \frac{p_{ij}}{q_{ij}}$$
  • This makes the neighbourhood distribution of $\mathbf{y}$ similar to $\mathbf{x}$

t-SNE

t-distributed stochastic neighbor embedding

  • From 3D to 2D and 1D

Isomap

Isomap algorithm

  • For each point, list neighbours (e.g. K-nearest neighbours)
  • Create a graph connecting each point to its neighbours
    • Edge is the euclidean distance in the original space
  • Compute pairwise distances between points from paths in the neighbourhood graphs
    • Euclidean distance but only along the neighbourhoods
    • Thus, distance along the manifold
  • Finally, compute low-dimensional embedding that respects distances
    • Multidimensional scaling: compute coordinates that approximate distances.

Isomap

  • From 3D to 2D and 1D

Manifold Learning

Capture the data manifold in lower dimensions

Manifold Learning

Capture the data manifold in lower dimensions

  • This generally cannot be done with a linear transformation
    • PCA just chooses a straight line along largest variance
    • Cannot follow the manifold
  • Comparison: PCA, t-SNE, Isomap

Manifold Learning

Nonlinear dimensionality reduction

  • Other examples:
    • Self-organizing maps
    • Kernel PCA
    • Autoencoders
  • In Scikit-learn:
    • Isomap
    • t-SNE
    • Locally Linear Embedding
    • Spectral Embedding
    • Multi-dimensional Scaling

Clustering and Manifold Learning

Rand index

Rand index

  • In supervised learning, we saw the confusion matrix
  • Example Class
    PredictionClass 1Class 0
    Class 1True PositiveFalse Positive
    Class 0False NegativeTrue Negative

  • In unsupervised learning we cannot match examples with classes but we can consider all $N(N-1)/2$ pairs of points:
    • True Positive: a pair from the same group placed in the same cluster
    • True Negative: a pair from different groups placed in different clusters
    • False Positive: a pair from different groups placed in the same cluster
    • False Negative: a pair from the same group placed in different clusters

Rand index

For all $N(N-1)/2$ pairs of examples
Same GroupDifferent Group
Same ClusterTrue PositiveFalse Positive
Different ClusterFalse NegativeTrue Negative

  • Using this analogy, we can compute the Rand index, which is analogous to accuracy, and other scores:

  • $$ precision = \frac{TP}{TP + FP} \qquad recall = \frac{TP}{TP + FN}$$
    $$Rand = \frac{TP + TN}{N(N-1)/2} \qquad F_1 = 2 \times \frac{precision \times recall}{precision + recall}$$

Rand index

Adjusted rand index

  • The Rand index can be artificially high by mere chance
    • Evem if clusters are assigned at random in many cases pairs from different groups will fall into different clusters
  • The adjusted Rand index solves this problem by correcting the index with the index value expected by chance:
  • $$ARI = \frac{RI - E(RI)}{1-E(RI)}$$
  • Note: while the Rand index must be between 0 and 1, the ARI can be negative.
  • In sklearn:

sklearn.metrics.adjusted_rand_score
		

Aprendizagem Automática

Assignment 2

Assignment 2

Clustering of cell images (Staphylococcus aureus)

  • Images from fluorescence microscopy
  • Automated segmentation (images include segmentation errors)

Assignment 2

  • Data set of 563 images of 50x50 pixels each, in images folder
  • Each image has a black background and the segmented region centered
  • Some images are labelled (1, 2 and 3) with the cell cycle phase

  • Labels are in file labels.txt (image ID, label)
    • A label of 0 indicates the image was not labelled

Assignment 2

Objective: help the biologists

  • Propose a procedure to aid the biologists
    • Reject segmentation errors
    • Select cells of some type
    • Classify according to the cell cycle phase
    • ...
  • Clustering similar cells can help deal with groups instead of individual examples

Assignment 2

Different from assignment 1

  • Assignment 1 was mostly implementation and following the recipes
  • Assignment 2 is more realistic: you are the data scientists
    • You must figure out what to do with the data
    • (The biologists know nothing about clustering)
  • Part of the assignment is scripted:
    • Use PCA, t-SNE and Isomap to extract 18 features from the images
      • (6 components each)
    • Test at least 2 clustering algorithms (K-Means and DBSCAN)
    • Find $\epsilon$ parameter with method from DBSCAN paper.
    • Implement and measure cluster quality metrics (rand index, ARI, Silhouette, etc)
  • The rest is up to you:
    • Select features, optimize clustering, evaluate results and decide how to use them

Assignment 2

Practical details

  • Read the assignment page carefully
    • And check it occasionally, as I may update it for clarifications
  • Download zip file with:
    • Images in the images folder
    • Labels (in labels.txt)
    • Auxiliary functions for loading data and generating HTML reports (tp2_aux.py)
    • Report example using the manual labels (example_labels.html)
  • Implement the code, examine the results
  • Answer the questions (TP2.txt)

Assignment 2

Optional: implement bisecting k-means

Clustering and Manifold Learning

Summary

Clustering and Manifold Learning

Summary

  • Fuzzy sets: model uncertainty
  • Fuzzy c-means: clustering with membership values
  • Manifold learning
  • Rand index and Assignment 2

Further reading

  • Manifold learning on Scikit-Learn
    • https://scikit-learn.org/stable/modules/manifold.html