Aprendizagem Automática

Hierarchical clustering

Ludwig Krippahl

Hierarchical clustering

Summary

  • Hierarchical Clustering
    • Agglomerative Clustering
    • Divisive Clustering
  • Clustering Features

Aprendizagem Automática

Hierarchical clustering

Hierarchical clustering

  • Grouping groups of groups (...)

Source: Wikipedia

Hierarchical clustering

  • Clustering gene activity patterns
  • Source: Wikipedia

Hierarchical clustering

  • Can be represented as dendrogram
  • Source: Wikipedia

Hierarchical clustering

  • Need to measure how alike examples are:
    • Proximity: Generic term for "likeness"
    • Similarity: Measure of how alike, generally $\in [0,1]$
    • Dissimilarity: Measure of difference.
    • Distance is special case of dissimilarity:
$$d(x,y)\geq 0 \ , \ d(x,y)=d(y,x) \ , \ d(x,z)\leq d(x,y) + d(y,z)$$
  • Some measures of distance between examples:
    • Euclidean: $\|x-y\|_2 = \sqrt{\sum\limits_{d}(x_d-y_d)^2}$
    • Squared Euclidean: $\|x-y\|_2^2 = \sum\limits_{d}(x_d-y_d)^2$
    • Manhattan: $\|x-y\|_1 = \sum\limits_{d}|x_d-y_d|$

Hierarchical clustering

  • Mahalanobis: Normalized by variance $$\sqrt{(x-y)^T Cov^{-1} (x-y)}$$
  • Hamming: differences between strings $$d(x,y) = \sum \limits_i x_i \neq y_i$$
  • Levenshtein: min number of edits: insertion, substitution, deletion
  • (many problem dependent measures)

Also need to compare groups (linkage)

  • (Some measures assume euclidean distance between examples)

Hierarchical clustering

  • Single linkage: $dist(C_j,C_k) = min\left(dist(x \in C_j, y \in C_k)\right)$

Hierarchical clustering

  • Complete linkage: $dist(C_j,C_k) = max\left(dist(x \in C_j, y \in C_k)\right)$

Hierarchical clustering

  • Centroid linkage: $dist(C_j,C_k) = dist\left(\frac{\sum x \in C_j}{|C_j|} , \frac{\sum y \in C_k}{|C_k|}\right)$

Hierarchical clustering

More examples of linkage

  • Average linkage $$dist(C_j,C_k) = mean\left(dist(x \in C_j, y \in C_k)\right)$$
  • Median linkage $$dist(C_j,C_k) = median\left(dist(x \in C_j, y \in C_k)\right)$$
  • Ward linkage: minimize SSE $$\sum \limits_{n=1}^{N} \sum \limits_{k=1}^{K} r_{nk} \| x_n - \mu_k \|^2$$

Hierarchical clustering

  • Some way to compare examples: distance, similarity, etc...
  • Some way to compare clusters (linkage): single, complete, etc...

Advantages:

  • No need to assume number of clusters
  • Hierarchical organization may correspond to some aspect of the data (e.g. phylogeny)

Disadvantages:

  • Single pass, local decisions may be wrong
  • Hierarchical organization may be confusing or reflect idiosyncrasies of the clustering algorithm

Hierarchical clustering

Clustering algorithms

  • Agglomerative clustering (bottom-up)
    • Start with singleton clusters, join best two (linkage), repeat until all joined
    • Generally $O(n^3)$, but can be better with linkage constraints
  • Divisive clustering (top-down)
    • Start with single cluster, pick cluster to split, repeat until all examples separated or level $k$ reached
    • Generally $O(2^n)$ for exhaustive search, and needs additional clustering algorithm for splitting.
    • But can be better if we only want a few levels of clustering from the top.

Aprendizagem Automática

Agglomerative Clustering

Agglomerative Clustering

  • Start with singleton clusters

Agglomerative Clustering

  • Join closest (linkage function), repeat

Agglomerative Clustering

  • Result represented in a dendrogram

Agglomerative Clustering

  • The result is a hierarchy of clusters
  • But we may want a partitional clustering
  • The solution is to select a level on the dendrogram

Agglomerative Clustering

  • Two clusters

Agglomerative Clustering

  • Two clusters

Agglomerative Clustering

  • Two clusters

Agglomerative Clustering

  • Three clusters

Agglomerative Clustering

  • Three clusters

Agglomerative Clustering

  • Three clusters

Agglomerative Clustering

  • Four clusters

Agglomerative Clustering

  • Five clusters

Agglomerative Clustering

  • Connectivity constraints
    • Agglomerative clustering is generally $O(n^3)$, not good for large datasets
    • Also, we may not want clustering to aggregate solely by distance

Agglomerative Clustering

    • We can prevent this by providing some structure via connectivity constraints
    • Connectivity constraints define the graph of connections between examples
    • Only clusters with connected examples can be joined
    • Forces clustering to respect structure and can greatly speedup computation
    • With Scikit-Learn, we can use the nearest neighbours graph:
      • (returns a sparse matrix of $N \times N$ with 1 on connected)
    
    from sklearn.cluster import AgglomerativeClustering
    from sklearn.neighbors import kneighbors_graph
    
    connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
    ward = AgglomerativeClustering(n_clusters=6, connectivity=connectivity,
                                   linkage='ward').fit(X)
    

    Based on this Scikit-Learn tutorial

Agglomerative Clustering

  • Without constraints clusters reach out over space:

Agglomerative Clustering

  • Constraints speed up and guide clustering

Agglomerative Clustering

  • Using AC with Scikit-Learn:

class sklearn.cluster.AgglomerativeClustering:
#arguments
n_clusters=2,         #number of clusters
affinity='euclidean', #distance between examples
connectivity=None,    #connectivity constraints
linkage='ward'        #'ward', 'complete', 'average'

#attributes
labels_               # array [n_samples]
children_             # array, shape (n_nodes-1, 2)
  • Three linkage options available in Scikit-Learn
    • Complete linkage $dist(C_j,C_k) = max\left(dist(x \in C_j, y \in C_k)\right)$
    • Average linkage $dist(C_j,C_k) = mean\left(dist(x \in C_j, y \in C_k)\right)$
    • Ward linkage: minimize SSE, $\sum \limits_{n=1}^{N} \sum \limits_{k=1}^{K} r_{nk} \| x_n - \mu_k \|^2$

A.C. and Linkage

  • Complete linkage tends to favour larger clusters

A.C. and Linkage

  • Average linkage solves that partially

A.C. and Linkage

  • Ward linkage is generally best but Euclidean only

Aprendizagem Automática

Divisive Clustering

Divisive Clustering

  • Bisecting k-Means algorithm:
    • Start with a single cluster for all examples
    • Select one cluster (largest, lowest score, ...)
    • Split cluster with k-means ($k=2$)
    • Repeat until desired number of clusters

Divisive Clustering

  • Splitting on largest cluster

Divisive Clustering

  • Resulting hierarchy:

Aprendizagem Automática

Divisive Clustering

  • Exhaustive search is $O(2^n)$
  • Top-down clustering requires clustering at each step to split (e.g. k-means)
  • However, it may be a good option if we want few large clusters and the auxiliary clustering algorithm is fast

Aprendizagem Automática

Clustering Features

Clustering Features

  • Clustering can also be used for dimensionality reduction
  • Clustering features allows us to agglomerate different features, average them and extract new features
  • E.g. a matrix of examples and features:

Clustering Features

  • Transposing, features as examples

Clustering Features

  • Clustering will group similar features together

  • Then agglomerate into smaller set of features

Clustering Features

  • Example: handwritten digits data set:

Clustering Features

  • Example: handwritten digits data set
  • Each digit is represented with $8 \times 8 = 64$ features
  • To reduce, we convert the 1797 examples of 64 features into 64 examples of 1797 features
  • Then we cluster the 64 into 16 clusters of similar features
    • But restrict linkage to adjacent pixels; so similar in same region of image

Clustering Features

  • Original data:

Clustering Features


import numpy as np
from sklearn import datasets, cluster
from sklearn.feature_extraction.image import grid_to_graph

digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))
connectivity = grid_to_graph(images[0].shape[0],images[0].shape[1])
agglo = cluster.FeatureAgglomeration(connectivity=connectivity,
                                     n_clusters=16)
agglo.fit(X)
X_reduced = agglo.transform(X)
X_restored = agglo.inverse_transform(X_reduced)

Clustering Features

  • Feature clusters, linkage to adjacent pixels

Clustering Features

  • Original data:

Clustering Features

  • Restored data (same size, repeated averages):

Clustering Features

  • Reduced data ($4 \times 4 = 16$ features):

Hierarchical Clustering

Summary

Hierarchical Clustering

Summary

  • Nested clusters
  • Measures: examples and clusters (linkage)
  • Bottom-up: Agglomerative Clustering
  • Top-down: divisive (bisecting k-means)
  • Effects of different linkage options
  • Feature agglomeration with hierarchical clustering

Further reading

  • Alpaydin, 7.7
  • Optional: Scikit-learn documentation on clustering:
http://scikit-learn.org/stable/modules/clustering.html