Aprendizagem Automática

Clustering: beyond prototypes

Ludwig Krippahl

Clustering: beyond prototypes

Summary

  • Problems with prototype clustering
  • Affinity propagation clustering
  • Density-based clustering
  • Clustering validation

Clustering: beyond prototypes

Prototypes

Prototypes

  • K-means is an example of prototype-based clustering

Prototypes (k-means)

  • Good for globular clusters with similar dispersion

Prototypes (k-means)

  • (assuming k is correct)

Prototypes (k-means)

  • Not good for non-globular clusters

Prototypes (k-means)

  • Or differing variances

Prototypes (k-means)

  • Note: this may be useful for applications like vector quantization

Prototypes

K-Means (and K-medoids), fixed number of prototypes

  • This may be useful in some applications
    • For example, for vector quantization or when we know how many clusters we want
  • But we must take this into consideration when deciding where to apply k-means and other prototype-based clustering algorithms

Prototypes: Affinity Propagation

Definitions:

  • similarity matrix $s_{i,k}$
    • $s_{k,k}$ propensity for being a prototype
  • Responsibility, $r_{i,k}$, "sent" from $i$ to $k$
    • Evidence for suitability of $k$ as prototype for $i$
  • Availability $a_{i,k}$, "sent" from candidate $k$ to point $i$
    • How much $k$ is available to represent $i$ as a prototype

Prototypes: Affinity Propagation

Affinity Propagation algorithm

  • Initialise $\mathbf{R}$ and $\mathbf{A}$ to zero
  • Update $r_{i,k}$
  • $$r_{i,k} \leftarrow s_{i,k} - \underset{k'\neq k} {\mathrm{max}}\left( a_{i,k'}+s_{i,k'}\right)$$
  • First iteration, $r_{i,k}$ is just the similarity relative to others
  • After discounts availability $a_{i,k'}$ becomes negative as assigned to other prototypes
  • Responsibility update: candidate prototypes compete for points

Prototypes: Affinity Propagation

Affinity Propagation algorithm

  • Next, we update availability:
  • $$a_{i,k (i\neq k)} \leftarrow \mathrm{min} \left(0, r_{k,k} + \sum \limits_{i' \not\in \{i,k\}} \mathrm{max} (0, r_{i',k})\right) $$
  • Self responsibility plus sum of positive responsibilities $k$ receives from other points
  • $$a_{k,k} \leftarrow \sum \limits_{i' \neq k} \mathrm{max} \left(0, r_{i',k}\right) $$
  • Self availability, evidence that $k$ is prototype

Prototypes: Affinity Propagation

Affinity Propagation algorithm

  • Identifying prototypes and clusters:
    • For each point $i$, compute
    • $$k' = \underset{k} {\mathrm{arg \ max}} \ a_{i,k}+r_{i,k}$$
    • If $k' = i$ then point $i$ is a prototype
    • If $k' \neq i$ then point $i$ belongs to prototype $k'$

Prototypes: Affinity Propagation

Prototypes: Affinity Propagation

Prototypes: Affinity Propagation

Prototypes: Affinity Propagation

Prototypes: Affinity Propagation

Prototypes: Affinity Propagation

Prototypes: Affinity Propagation

  • Automatic $k$, but still based on prototypes

Prototypes: Affinity Propagation

Affinity Propagation with Scikit-Learn


class sklearn.cluster.AffinityPropagation:
   damping=0.5  # to avoid oscilations when updating
   max_iter=200  convergence_iter=15, #stop criteria
   preference=None                    #set s(k,k)
   affinity='euclidean'               #or 'precomputed'
   
.cluster_centers_indices_ 
.labels_
.affinity_matrix_
.n_iter_ 

Clustering: beyond prototypes

DBSCAN

DBSCAN

  • Density-based spatial clustering of applications with noise
  • Neighbourhood $N_\epsilon$: all points within distance $\epsilon$
  • $p$ is a core point if at least $\mathrm{minPts}$ in $N_\epsilon$ (those points are directly reachable from core point $p$)
  • $q$ is reachable from $p$ if directly reachable from $p$ or from core point $p'$ reachable from $p$

DBSCAN algorithm

  • Visit each point $p$
  • If $|N_\epsilon(p)| < \mathrm{minPts}$ then $p$ is (presumed) noise
  • Else, create cluster, for each $q \in N_\epsilon(p)$, add $q$ to cluster.
  • If $|N_\epsilon(q)| \geq \mathrm{minPts}$ merge clusters

DBSCAN

  • Density-based clustering solves cluster shape problems

DBSCAN

  • And can handle noise

DBSCAN

  • Note: core points are not prototypes

DBSCAN

DBSCAN with Scikit-Learn


class sklearn.cluster.DBSCAN:
   eps=0.5,            #epsilon for neighb.
   min_samples=5,      #minPts
   metric='euclidean'  #can be callable or precomputed

.core_sample_indices_
.labels_            #-1 for noise

Clustering: beyond prototypes

Clustering validation

Clustering validation

  • In supervised learning we can measure error
  • Generaly, for clustering we can only assess how "good" clusters are
    • What is "good"? Depends on what we want to do.
  • Different contexts:
    • Determine if the data actually has structure
    • Determine the number of clusters
    • Evaluate how well computed clusters fit data structure (without external information)
    • Evaluate how well computed clusters fit known classification
    • Compare sets of clusters

Clustering validation

Unsupervised clustering validation

  • Measure how good clusters are without referring to external data (internal indices):
    • Cluster cohesion
    • Cluster separation
    • Sum of Squares Error
    • Silhouette score

Clustering validation

Unsupervised clustering validation

  • Cohesion of cluster $i$:
  • $$cohesion(C_i) = \sum \limits_{x \in C_i} \sum \limits_{y \in C_i} S(x,y)$$
  • Separation between clusters $i,j$:
  • $$separation(C_i, C_j) = \sum \limits_{x \in C_i} \sum \limits_{y \in C_j} S(x,y)$$

(For prototype-based clusterings):

  • Cohesion of cluster $i$: $$cohesion(C_i) = \sum \limits_{x \in C_i} S(x,\mathbf{c}_i)$$
  • Separation between clusters $i,j$: $$separation(C_i, C_j) = S(\mathbf{c}_i,\mathbf{c}_j)$$

Clustering validation

Clustering validation (unsupervised)

  • Sum of squares error, for prototype-based clusterings:
  • $$SSE = \sum \limits_{k} \sum \limits_{x \in C_k} dist(x,\mathbf{c}_k)$$
  • Silhouette Score:
    • $a(i)$: average distance of $i$ to all points in same cluster
    • $b(i)$: average distance of $i$ to all points in nearest cluster (lowest average distance)
    $$s(i) = \frac{b(i)-a(i)}{\mathrm{max} \left(a\left(i\right),b\left(i\right)\right)}$$
  • $-1 \leq s(i) \leq 1$, with $s(i)$ greater if $a(i)\ll b(i)$

Clustering validation

Clustering validation (unsupervised)

  • Silhouette Score with Scikit-Learn

from sklearn.metrics import silhouette_score
#...
print silhouette_score(data, labels)

Clustering validation

  • Silhouette Score: 0.733

Clustering validation

  • Silhouette Score: 0.628

Clustering validation

  • Silhouette Score: 0.439

Clustering validation

  • Silhouette Score: 0.639

Clustering validation

  • Silhouette score: 0.336 (better for compact clusters)

Clustering validation

  • External index: supervised, compare clusters to some known structure
  • Internal index: unsupervised, internal indicators of cohesion, separation, etc
  • Relative index: Compare different clusterings (usually with internal indices)

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes (1988)

Clustering: beyond prototypes

Summary

Clustering: beyond prototypes

Summary

  • Prototype-based: k-means, AP
  • Density-based: DBSCAN
  • Cluster validation, unsupervised
  • (Cluster validation, supervised)

Further reading

    • Affinity Propagation: Frey et al. "Clustering by passing messages between data points." Science 2007
    • DBSCAN: Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." KDD'96, 1996.
    • Scikit-learn documentation on clustering