# Introduction to Clustering

## Introduction to Clustering

### Summary

• Introduction to clustering
• K-means and k-medoids
• A first (informal) look at Expectation-Maximization

# Clustering

## Clustering

• General idea: Group similar examples together ## Clustering

• General idea: Group similar examples together ## Clustering

• General idea: Group similar examples together
• Min difference (or max similarity) within clusters
• Max difference (or min similarity) between clusters
• Clustering requires some measure of similarity or difference

### Why clustering?

• To help understand data and relations
• Meaningful grouping is part of how we understand things
• Clustering helps make sense of complex or abundant data (WWW searches, biology, climate),
• Clustering can help identify meaningful groups
• To reduce data to relevant examples

## Clustering

• Clustering to understand data: Darwin's "tree of life" Source: Wikimedia Commons, public domain.

## Clustering

• Clustering to understand data: Identifying "Ghost Cities" “Ghost Cities” Analysis [...]. Chi et. al, http://arxiv.org/abs/1510.08505

## Clustering

### Why clustering?

• To reduce (summarize) data:
• Cluster prototypes abstract properties from groups and can be used to reduce the data set replacing all points in group
• To improve performance of data analysis or classification (often quadratic on the number of examples)
• Sumarizing, compressing, finding neighbours, image segmentation, etc

## Why clustering?

• To reduce (summarize) data:
• ## Why clustering?

• To reduce (summarize) data:
• ## Defining clusters

• Are there two clusters here?
• ## Defining clusters

• Or three clusters?
• ## Defining clusters

• Four?
• ## Defining clusters

• Five?
• ## Defining clusters

• Several things we need to decide about the clustering (the set of clusters)
• Partitional clustering: Data is divided into groups at the same level
• Hierarchical clustering: Clusters are nested within larger clusters, in a tree

## Defining clusters

• Partitional clustering ## Defining clusters

• Hierarchical clustering ## Defining clusters

• Clustering, example membership
• Exclusive clustering: Each example belongs to only one cluster
• Overlapping clustering: Examples may belong to more than one cluster
• Fuzzy clustering: Each example belongs to clusters with $w_k \in [0;1]$
• Probabilistic clustering: As fuzzy, but $\sum \limits_{k=1}^{K} w_k = 1$
• Clustering, coverage of examples
• Complete clustering: All examples are assigned to cluster (or clusters)
• Partial clustering: Some examples unassigned (e.g. noise, irrelevant data, etc)

## Types of clusters

• Well-Separated (data dependent): More similar to all cluster than other clusters
• ## Types of clusters

• Prototype-based clustering: Examples closer to prototype
• ## Types of clusters

• Contiguity-based clustering: Closer to some in cluster than any outside
• ## Types of clusters

• Density-based clustering: High density regions (discards noise)
• ## Clustering

• How do we compare examples?
• (What similarity measures?)
• How are examples related to clusters?
• (All in clusters? Some? Partially or fully?)
• How do we group examples together?
• (Prototypes? Density? Affinity?)
• How meaningful are the clusters?
• (Statistically significant? Make sense?)

# K-MEANS

## K-means

• Find best clustering of $k$ clusters
• Define clusters by proximity to the mean of the cluster
• The number of centroids and clusters is predefined ($k$)
• Partitional, exclusive, complete and prototype-based

### K-means (Lloyd's) algorithm

• Start with random centroids
• Assign each example to closest centroid
• Update centroids to mean of respective cluster
• Recompute clusters, repeat until convergence

## K-means initialization

• Forgy: start with cordinates of a random set of examples ## K-means initialization

• Forgy: start with cordinates of a random set of examples ## K-means initialization

• Forgy: start with cordinates of a random set of examples ## K-means initialization

• Random Partition: random assignment, compute means ## K-means algorithm

• Original data: ## K-means algorithm

• Initialize (Forgy), compute clusters ## K-means algorithm

• Compute new means, update centroids ## K-means algorithm

• Recompute clusters ## K-means algorithm

• Compute new means, update centroids ## K-means algorithm

• Recompute clusters ## K-means algorithm

• Compute new means, update centroids ## K-means algorithm

• Recompute clusters ## K-means algorithm

• Compute new means, update centroids ## K-means algorithm

• Recompute clusters ## K-means algorithm

• Compute new means, update centroids ## K-means algorithm

• Until convergence ## K-means algorithm

• Attribute points to centroids

def closest_centroids(data,centroids):
ys = np.zeros(data.shape)
for ix in range(data.shape):
dists = np.sum((centroids-data[ix,:])**2,axis=1)
ys[ix] = np.argmin(dists)
return ys

• Move centroids to mean points

def adjust_centroids(data,centroids):
ys = closest_centroids(data,centroids)
for ix in range(centroids.shape):
centroids[ix,:] = np.mean(data[ys==ix,:],axis=0)


## K-means algorithm

• Initialization (Forgy method)

def forgy(data,k):
ixs = np.arange(data.shape)
np.random.shuffle(ixs)
return data[ixs[:k],:].copy()

• Initialization (Random Partition)

def rand_part(data,k):
ys = np.random.randint(0,k,data.shape)
centroids = np.zeros((k,data.shape))
for ix in range(k):
centroids[ix,:] = np.mean(data[ys==ix,:],axis=0)
return centroids,ys


## K-means algorithm

• Finding the best prototypes for the clusters
• Distance measure, typically euclidean, but Minkowsky more general:
• $$D_{x,x'} = \sqrt[p]{\sum_d{|x_d-x_d'|^p}}$$
• Conceptually similar to a SOM (without the topology)
• ### K-medoids

• Variant of k-means where prototypes are data points (medoids)
• Needs only a dissimilarity matrix
• Assign each example to cluster of most similar medoid
• For each medoid and example, test if example is better medoid
• Stop when no improvement possible

# EM algorithm

## Expectation-Maximization algorithm

### Background

• Set $X$ of observed data
• Set $Z$ of missing data or latent variables
• Unknown parameters $\theta$
• Likelihood function $L(\theta;X,Z) = p(X,Z|\theta)$

### (Contrast with supervised learning)

• In supervised learning we have all the data when training

## Expectation-Maximization algorithm

• Example: $\{X,Z\}$ is the complete data ## Expectation-Maximization algorithm

• But we only have $X$ the incomplete data ## Expectation-Maximization algorithm

• We cannot compute $L(\theta;X,Z) = p(X,Z|\theta)$ directly because $Z$ is unknown
• But we can estimate the posterior probability $p(Z|X,\theta^{old})$

### EM algorithm

• Expectation step: compute the expected values of $Z$ to get a likelihood function:
• $$\mathcal{Q}(\theta,\theta^{old}) = \mathrm{E}_{Z|X,\theta^{old}}\ln p(X,Z|\theta)$$
• Maximization step: obtain $\theta^{new}$ maximizing $\mathcal{Q}(\theta,\theta^{old})$
• $$\theta^{new} = \underset{\theta} {\mathrm{arg \ max}} \ \mathrm{E}_{Z|X,\theta^{old}}\ln p(X,Z|\theta)$$
• Repeat $\theta^{old}\leftarrow \theta^{new}$ until convergence

## Expectation-Maximization algorithm

### EM and k-means

• EM as an alternating optimization over different variables
• Define the distortion measure $$J = \sum \limits_{n=1}^{N} \sum \limits_{k=1}^{K} r_{nk} \| x_n - \mu_k \|^2$$ where $r_{nk}$ is 1 if $x_n$ belongs to $\mu_k$, else 0
• Expected log-likelihood tends towards: $$\mathrm{E}_Z\left(\ln p(X,Z|\pmb{\mu})\right) \rightarrow - \frac{1}{2} \sum \limits_{n=1}^{N} \sum \limits_{k=1}^{K} r_{nk} \| x_n - \mu_k \|^2 + C$$

## Expectation-Maximization algorithm

### EM and k-means

• EM is (almost) the algorithm used for k-means
• Estimate $r_{nk}$, cluster assignment, from current centroids
• Estimation of the likelihood function, $-\frac{1}{2}J+C$
• Means maximize the estimated likelihood function
• (We'll demonstrate this better after gaussian mixtures)

# K-means Example

## K-means color quantization

• Vector quantization is a common technique in signal compression
• We'll use k-means to quantize the color values of an image to reduce the color space

## K-means color quantization

• This image has only red and green (no blue) ## K-means color quantization

• So we can plot all pixels green vs red ## K-means color quantization

• Load and create pixel array (red, green)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from skimage.io import imsave,imread

rosa = imread("Rosa.png")
w,h,d = rosa.shape
cols = np.reshape(rosa/255.0, (w * h, d))
image_array = cols[:,:2]

• Plot the pixel colors(red, green)

plt.figure(figsize=(12,8))
plt.xlabel('Red')
plt.ylabel('Green')
plt.scatter(image_array[:, 0], image_array[:, 1], color=cols.tolist(), s=10)
plt.axis([0,1,0,1])
plt.savefig('L17-rosa-plot.png', dpi=200,bbox_inches='tight')


## K-means color quantization

• Plot of all pixels green vs red ## K-means color quantization

• Now we apply k-means with 64 centroids ## K-means color quantization

• Compute k-means, with 64 colors (6 bits)

n_colors = 64
kmeans = KMeans(n_clusters=n_colors).fit(image_array)
labels = kmeans.predict(image_array)
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x',
color='k',s=200, linewidths=5)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x',
color='w',s=150, linewidths=2)
plt.savefig('L17-rosa-plot-cs-'+str(n_colors)+'.png',
dpi=200,bbox_inches='tight')


## K-means color quantization

• And convert pixel colors to centroid ## K-means color quantization

• Convert to centroid color and save

c_cols = np.zeros(cols.shape)
for ix in range(cols.shape):
c_cols[ix,:2]=centroids[labels[ix]]
# ...plot
plt.scatter(image_array[:, 0], image_array[:, 1], color=c_cols.tolist(), s=10)
# ...
imsave('rosa-comp'+str(n_colors)+'.png', np.reshape(c_cols,(w,h,3)))


## K-means color quantization

• Compress from 24 bits to 6 bits  ## K-means color quantization

• Compress more, 8 centroids ## K-means color quantization

• Convert pixel colors to centroid ## K-means color quantization

• Compress from 24 bits to 3 bits  # Summary

## Clustering

### Summary

• Clustering: grouping data by similarity
• Hierarchical vs Partitional, Complete vs Partial
• Exclusive, Overlapping, Fuzzy or Probabilistic
• Types of clusters: Prototype, Contiguity, Density
• K-means: prototype, partitional
• EM algorithm: estimate, maximize, repeat

### Further reading

• Bishop, Section 9.1 (and 9.3, intro, for EM)
• Alpaydin, Section 7.3
• Marsland, Chapter 9 (also relates k-means to SOM)