# Clustering: beyond prototypes

## Clustering: beyond prototypes

### Summary

• Problems with prototype clustering
• Affinity propagation clustering
• Density-based clustering
• Clustering validation

# Prototypes

## Prototypes

• K-means is an example of prototype-based clustering

## Prototypes (k-means)

• Good for globular clusters with similar dispersion

## Prototypes (k-means)

• (assuming k is correct)

## Prototypes (k-means)

• Not good for non-globular clusters

## Prototypes (k-means)

• Or differing variances

## Prototypes (k-means)

• Note: this may be useful for applications like vector quantization

## Prototypes

### K-Means (and K-medoids), fixed number of prototypes

• This may be useful in some applications
• For example, for vector quantization or when we know how many clusters we want
• But we must take this into consideration when deciding where to apply k-means and other prototype-based clustering algorithms

## Prototypes: Affinity Propagation

### Definitions:

• similarity matrix $s_{i,k}$
• $s_{k,k}$ propensity for being a prototype
• Responsibility, $r_{i,k}$, "sent" from $i$ to $k$
• Evidence for suitability of $k$ as prototype for $i$
• Availability $a_{i,k}$, "sent" from candidate $k$ to point $i$
• How much $k$ is available to represent $i$ as a prototype

## Prototypes: Affinity Propagation

### Affinity Propagation algorithm

• Initialise $\mathbf{R}$ and $\mathbf{A}$ to zero
• Update $r_{i,k}$
• $$r_{i,k} \leftarrow s_{i,k} - \underset{k'\neq k} {\mathrm{max}}\left( a_{i,k'}+s_{i,k'}\right)$$
• First iteration, $r_{i,k}$ is just the similarity relative to others
• After discounts availability $a_{i,k'}$ becomes negative as assigned to other prototypes
• Responsibility update: candidate prototypes compete for points

## Prototypes: Affinity Propagation

### Affinity Propagation algorithm

• Next, we update availability:
• $$a_{i,k (i\neq k)} \leftarrow \mathrm{min} \left(0, r_{k,k} + \sum \limits_{i' \not\in \{i,k\}} \mathrm{max} (0, r_{i',k})\right)$$
• Self responsibility plus sum of positive responsibilities $k$ receives from other points
• $$a_{k,k} \leftarrow \sum \limits_{i' \neq k} \mathrm{max} \left(0, r_{i',k}\right)$$
• Self availability, evidence that $k$ is prototype

## Prototypes: Affinity Propagation

### Affinity Propagation algorithm

• Identifying prototypes and clusters:
• For each point $i$, compute
• $$k' = \underset{k} {\mathrm{arg \ max}} \ a_{i,k}+r_{i,k}$$
• If $k' = i$ then point $i$ is a prototype
• If $k' \neq i$ then point $i$ belongs to prototype $k'$

## Prototypes: Affinity Propagation

• Automatic $k$, but still based on prototypes

## Prototypes: Affinity Propagation

### Affinity Propagation with Scikit-Learn


class sklearn.cluster.AffinityPropagation:
damping=0.5  # to avoid oscilations when updating
max_iter=200  convergence_iter=15, #stop criteria
preference=None                    #set s(k,k)
affinity='euclidean'               #or 'precomputed'

.cluster_centers_indices_
.labels_
.affinity_matrix_
.n_iter_


# DBSCAN

## DBSCAN

• Density-based spatial clustering of applications with noise
• Neighbourhood $N_\epsilon$: all points within distance $\epsilon$
• $p$ is a core point if at least $\mathrm{minPts}$ in $N_\epsilon$ (those points are directly reachable from core point $p$)
• $q$ is reachable from $p$ if directly reachable from $p$ or from core point $p'$ reachable from $p$

### DBSCAN algorithm

• Visit each point $p$
• If $|N_\epsilon(p)| < \mathrm{minPts}$ then $p$ is (presumed) noise
• Else, create cluster, for each $q \in N_\epsilon(p)$, add $q$ to cluster.
• If $|N_\epsilon(q)| \geq \mathrm{minPts}$ merge clusters

## DBSCAN

• Density-based clustering solves cluster shape problems

## DBSCAN

• And can handle noise

## DBSCAN

• Note: core points are not prototypes

## DBSCAN

### DBSCAN with Scikit-Learn


class sklearn.cluster.DBSCAN:
eps=0.5,            #epsilon for neighb.
min_samples=5,      #minPts
metric='euclidean'  #can be callable or precomputed

.core_sample_indices_
.labels_            #-1 for noise


# Clustering validation

## Clustering validation

• In supervised learning we can measure error
• Generaly, for clustering we can only assess how "good" clusters are
• What is "good"? Depends on what we want to do.
• Different contexts:
• Determine if the data actually has structure
• Determine the number of clusters
• Evaluate how well computed clusters fit data structure (without external information)
• Evaluate how well computed clusters fit known classification
• Compare sets of clusters

## Clustering validation

### Unsupervised clustering validation

• Measure how good clusters are without referring to external data (internal indices):
• Cluster cohesion
• Cluster separation
• Sum of Squares Error
• Silhouette score

## Clustering validation

### Unsupervised clustering validation

• Cohesion of cluster $i$:
• $$cohesion(C_i) = \sum \limits_{x \in C_i} \sum \limits_{y \in C_i} S(x,y)$$
• Separation between clusters $i,j$:
• $$separation(C_i, C_j) = \sum \limits_{x \in C_i} \sum \limits_{y \in C_j} S(x,y)$$

### (For prototype-based clusterings):

• Cohesion of cluster $i$: $$cohesion(C_i) = \sum \limits_{x \in C_i} S(x,\mathbf{c}_i)$$
• Separation between clusters $i,j$: $$separation(C_i, C_j) = S(\mathbf{c}_i,\mathbf{c}_j)$$

## Clustering validation

### Clustering validation (unsupervised)

• Sum of squares error, for prototype-based clusterings:
• $$SSE = \sum \limits_{k} \sum \limits_{x \in C_k} dist(x,\mathbf{c}_k)$$
• Silhouette Score:
• $a(i)$: average distance of $i$ to all points in same cluster
• $b(i)$: average distance of $i$ to all points in nearest cluster (lowest average distance)
$$s(i) = \frac{b(i)-a(i)}{\mathrm{max} \left(a\left(i\right),b\left(i\right)\right)}$$
• $-1 \leq s(i) \leq 1$, with $s(i)$ greater if $a(i)\ll b(i)$

## Clustering validation

### Clustering validation (unsupervised)

• Silhouette Score with Scikit-Learn

from sklearn.metrics import silhouette_score
#...
print silhouette_score(data, labels)


## Clustering validation

• Silhouette Score: 0.733

## Clustering validation

• Silhouette Score: 0.628

## Clustering validation

• Silhouette Score: 0.439

## Clustering validation

• Silhouette Score: 0.639

## Clustering validation

• Silhouette score: 0.336 (better for compact clusters)

## Clustering validation

• External index: supervised, compare clusters to some known structure
• Internal index: unsupervised, internal indicators of cohesion, separation, etc
• Relative index: Compare different clusterings (usually with internal indices)

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes (1988)

# Summary

## Clustering: beyond prototypes

### Summary

• Prototype-based: k-means, AP
• Density-based: DBSCAN
• Cluster validation, unsupervised
• (Cluster validation, supervised)

• Affinity Propagation: Frey et al. "Clustering by passing messages between data points." Science 2007
• DBSCAN: Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." KDD'96, 1996.
• Scikit-learn documentation on clustering