# Hierarchical clustering

## Hierarchical clustering

### Summary

• Hierarchical Clustering
• Agglomerative Clustering
• Divisive Clustering
• Clustering Features

# Hierarchical clustering

## Hierarchical clustering

• Grouping groups of groups (...) Source: Wikipedia

## Hierarchical clustering

• Clustering gene activity patterns
• Source: Wikipedia

## Hierarchical clustering

• Can be represented as dendrogram
•  Source: Wikipedia

## Hierarchical clustering

• Need to measure how alike examples are:
• Proximity: Generic term for "likeness"
• Similarity: Measure of how alike, generally $\in [0,1]$
• Dissimilarity: Measure of difference.
• Distance is special case of dissimilarity:
$$d(x,y)\geq 0 \ , \ d(x,y)=d(y,x) \ , \ d(x,z)\leq d(x,y) + d(y,z)$$
• Some measures of distance between examples:
• Euclidean: $\|x-y\|_2 = \sqrt{\sum\limits_{d}(x_d-y_d)^2}$
• Squared Euclidean: $\|x-y\|_2^2 = \sum\limits_{d}(x_d-y_d)^2$
• Manhattan: $\|x-y\|_1 = \sum\limits_{d}|x_d-y_d|$

## Hierarchical clustering

• Mahalanobis: Normalized by variance $$\sqrt{(x-y)^T Cov^{-1} (x-y)}$$
• Hamming: differences between strings $$d(x,y) = \sum \limits_i x_i \neq y_i$$
• Levenshtein: min number of edits: insertion, substitution, deletion
• (many problem dependent measures)

### Also need to compare groups (linkage)

• (Some measures assume euclidean distance between examples)

## Hierarchical clustering

• Single linkage: $dist(C_j,C_k) = min\left(dist(x \in C_j, y \in C_k)\right)$ ## Hierarchical clustering

• Complete linkage: $dist(C_j,C_k) = max\left(dist(x \in C_j, y \in C_k)\right)$ ## Hierarchical clustering

• Centroid linkage: $dist(C_j,C_k) = dist\left(\frac{\sum x \in C_j}{|C_j|} , \frac{\sum y \in C_k}{|C_k|}\right)$ ## Hierarchical clustering

### More examples of linkage

• Average linkage $$dist(C_j,C_k) = mean\left(dist(x \in C_j, y \in C_k)\right)$$
• Median linkage $$dist(C_j,C_k) = median\left(dist(x \in C_j, y \in C_k)\right)$$
• Ward linkage: minimize SSE $$\sum \limits_{n=1}^{N} \sum \limits_{k=1}^{K} r_{nk} \| x_n - \mu_k \|^2$$

## Hierarchical clustering

• Some way to compare examples: distance, similarity, etc...
• Some way to compare clusters (linkage): single, complete, etc...

### Advantages:

• No need to assume number of clusters
• Hierarchical organization may correspond to some aspect of the data (e.g. phylogeny)

### Disadvantages:

• Single pass, local decisions may be wrong
• Hierarchical organization may be confusing or reflect idiosyncrasies of the clustering algorithm

## Hierarchical clustering

### Clustering algorithms

• Agglomerative clustering (bottom-up)
• Start with singleton clusters, join best two (linkage), repeat until all joined
• Generally $O(n^3)$, but can be better with linkage constraints
• Divisive clustering (top-down)
• Start with single cluster, pick cluster to split, repeat until all examples separated or level $k$ reached
• Generally $O(2^n)$ for exhaustive search, and needs additional clustering algorithm for splitting.
• But can be better if we only want a few levels of clustering from the top.

# Agglomerative Clustering

## Agglomerative Clustering

• Start with singleton clusters ## Agglomerative Clustering

• Join closest (linkage function), repeat ## Agglomerative Clustering

• Result represented in a dendrogram
• ## Agglomerative Clustering

• The result is a hierarchy of clusters
• But we may want a partitional clustering
• The solution is to select a level on the dendrogram

## Agglomerative Clustering

• Two clusters ## Agglomerative Clustering

• Two clusters ## Agglomerative Clustering

• Two clusters ## Agglomerative Clustering

• Three clusters ## Agglomerative Clustering

• Three clusters
• ## Agglomerative Clustering

• Three clusters ## Agglomerative Clustering

• Four clusters  ## Agglomerative Clustering

• Five clusters  ## Agglomerative Clustering

• Connectivity constraints
• Agglomerative clustering is generally $O(n^3)$, not good for large datasets
• Also, we may not want clustering to aggregate solely by distance ## Agglomerative Clustering

• We can prevent this by providing some structure via connectivity constraints
• Connectivity constraints define the graph of connections between examples
• Only clusters with connected examples can be joined
• Forces clustering to respect structure and can greatly speedup computation
• With Scikit-Learn, we can use the nearest neighbours graph:
• (returns a sparse matrix of $N \times N$ with 1 on connected)

from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import kneighbors_graph

connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
ward = AgglomerativeClustering(n_clusters=6, connectivity=connectivity,
linkage='ward').fit(X)


Based on this Scikit-Learn tutorial

## Agglomerative Clustering

• Without constraints clusters reach out over space: ## Agglomerative Clustering

• Constraints speed up and guide clustering ## Agglomerative Clustering

• Using AC with Scikit-Learn:

class sklearn.cluster.AgglomerativeClustering:
#arguments
n_clusters=2,         #number of clusters
affinity='euclidean', #distance between examples
connectivity=None,    #connectivity constraints
linkage='ward'        #'ward', 'complete', 'average'

#attributes
labels_               # array [n_samples]
children_             # array, shape (n_nodes-1, 2)

• Three linkage options available in Scikit-Learn
• Complete linkage $dist(C_j,C_k) = max\left(dist(x \in C_j, y \in C_k)\right)$
• Average linkage $dist(C_j,C_k) = mean\left(dist(x \in C_j, y \in C_k)\right)$
• Ward linkage: minimize SSE, $\sum \limits_{n=1}^{N} \sum \limits_{k=1}^{K} r_{nk} \| x_n - \mu_k \|^2$

## A.C. and Linkage

• Complete linkage tends to favour larger clusters ## A.C. and Linkage

• Average linkage solves that partially ## A.C. and Linkage

• Ward linkage is generally best but Euclidean only # Divisive Clustering

## Divisive Clustering

• Bisecting k-Means algorithm:
• Start with a single cluster for all examples
• Select one cluster (largest, lowest score, ...)
• Split cluster with k-means ($k=2$)
• Repeat until desired number of clusters

## Divisive Clustering

• Splitting on largest cluster ## Divisive Clustering

• Resulting hierarchy: ## Aprendizagem Automática

### Divisive Clustering

• Exhaustive search is $O(2^n)$
• Top-down clustering requires clustering at each step to split (e.g. k-means)
• However, it may be a good option if we want few large clusters and the auxiliary clustering algorithm is fast

# Clustering Features

## Clustering Features

• Clustering can also be used for dimensionality reduction
• Clustering features allows us to agglomerate different features, average them and extract new features
• E.g. a matrix of examples and features:

## Clustering Features

• Transposing, features as examples

## Clustering Features

• Clustering will group similar features together

• Then agglomerate into smaller set of features

## Clustering Features

• Example: handwritten digits data set: ## Clustering Features

• Example: handwritten digits data set
• Each digit is represented with $8 \times 8 = 64$ features
• To reduce, we convert the 1797 examples of 64 features into 64 examples of 1797 features
• Then we cluster the 64 into 16 clusters of similar features
• But restrict linkage to adjacent pixels; so similar in same region of image

## Clustering Features

• Original data: ## Clustering Features


import numpy as np
from sklearn import datasets, cluster
from sklearn.feature_extraction.image import grid_to_graph

digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))
connectivity = grid_to_graph(images.shape,images.shape)
agglo = cluster.FeatureAgglomeration(connectivity=connectivity,
n_clusters=16)
agglo.fit(X)
X_reduced = agglo.transform(X)
X_restored = agglo.inverse_transform(X_reduced)


## Clustering Features

• Feature clusters, linkage to adjacent pixels ## Clustering Features

• Original data: ## Clustering Features

• Restored data (same size, repeated averages): ## Clustering Features

• Reduced data ($4 \times 4 = 16$ features): # Summary

## Hierarchical Clustering

### Summary

• Nested clusters
• Measures: examples and clusters (linkage)
• Bottom-up: Agglomerative Clustering
• Top-down: divisive (bisecting k-means)
• Effects of different linkage options
• Feature agglomeration with hierarchical clustering

### Further reading

• Alpaydin, 7.7
• Optional: Scikit-learn documentation on clustering:
http://scikit-learn.org/stable/modules/clustering.html