Assignment 2

Dates and rules.

This assignment is for the same groups registered for the first assignment.

The deadline for submitting the assignment is December 8, at 23:59 (plus 24 hours of tolerance for solving submission problems). You must submit a zip file containing, at least, these two files, named exactly as specified:

This is your report, in pdf format
This is a Python 3.x script that can be used to run your code for this assignment.

You can include other files in the zip file, if necessary (e.g. other python modules if you wish to separate your code into several files)

Your report may be in English or Portuguese.

Please check this page regularly for updates of clarifications. Last update: 2018-11-12.


The goal of this assignment is to examine the performance of three different clustering algorithms in the problem of clustering seismic events. The data set is a csv file with all seismic events of magnitude at least 6.5 in the last 100 years, which you can download here: tp2_data.csv. This data set was obtained from the USGS catalog. A new column, fault, was added to the original data indicating the fault line between tectonic plates closest to each seismic event. Each fault line is identified by an integer greater or equal to 0. An integer value of -1 indicates that none of the relevant fault lines was sufficiently close to the seismic event. Only those fault lines with at least 30 seismic events in the data set were considered relevant. The fault line data was also obtained from the USGS site.

Although not necessary for the assignment, for those who wish to examine the fault line information, a list of all the faults, along with their identifiers, is in file fault_names.csv and file faults.csv has a list of coordinates outlining each fault line.

The image below shows the fault lines. The thicker lines in different colours are the relevant fault lines used in the data set.

The following image shows the seismic events coloured according to the associated fault line.

And this image shows the seismic events coloured according to one of the clustering algorithms

For this assignment, you will parametrize and compare three clustering algorithms: K-Means, DBSCAN and Gaussian Mixture Models. Examine the performance of each algorithm by varying the main parameter of each one (number of clusters, neighbourhood distance ε and number of gaussian components, respectively) and using an internal index, the silhouette score, and external indexes computed from the fault line information: the Rand index, Precision, Recall, the F1 measure and the adjusted Rand index. Note that the adjusted Rand index can be computed using the adjusted_rand_score function and the silhouette score using silhouette_score, both from sklearn.metrics.

In addition, for the DBSCAN algorithm, read the original paper on this algorithm and implement the parameter selection method described:

A density-based algorithm for discovering clusters in large spatial databases with noise (1996). Martin Ester , Hans-Peter Kriegel , Jörg Sander , Xiaowei Xu. The paper also available here.

Compare the results using this method with the ε value you chose from the analysis of the diffferent indexes. As suggested by the authors, use a value of 4 for the minimum number of neighbouring points required for a point to be considered a core point. To implement this method you will need to understand it, as described in the paper. This is part of the assignment.

The main focus of this assignment should be the discussion of the advantages and disadvantages of each algorithm for this dataset, informed by an analysis of their behaviour with different parameters and the different scores used. For the DBSCAN algorithm, it is also important to discuss the adequacy of the method proposed by the authors for obtaining the value of ε in this particular case of clustering seismic events.

Guidelines for the implementation

You will have to load the longitude and latitude values from the data file and transform the coordinates into Earth-centered, Earth-fixed coordinates (x,y,z) in the following manner:

    x = RADIUS * cos(latitude * π/180) * cos(longitude * π/180)
    y = RADIUS * cos(latitude * π/180) * sin(longitude * π/180)
    z = RADIUS * sin(latitude * π/180)

where RADIUS is the average radius of the Earth (6371 km)

This is necessary because longitude and latitude coordinates cannot be used to measure distances between the seismic events. After this transformation, you should have all events distributed on the surface of a sphere, each with three coordinates (x,y,z):

You must also retain the last column of the table, fault, which indicates the assignment of each seismic event to one of the relevant fault lines between tectonic plates. A value of -1 in this column indicates the event was not assigned to any of the relevant fault lines.

When testing different parameters for your clustering algorithms, note that the silhouette and adjusted Rand scores can only be computed if you have at least 2 clusters. If all data is placed in the same cluster the program will raise an exception.

The method for selecting the ε parameter recommended by the authors of DBSCAN consists in plotting the sorted distance of each point to its fourth-nearest neighbour and setting ε to the distance corresponding to the “elbow” in the plot. You can compute this plot using the KNeighborsClassifier to fit your data and then obtaining the distances matrix to the k-nearest neighbour with the kneighbors method. This classifier requires an Y value for the classes, but you can use a vector filled with 0 or 1, since you will not be using the classifier as such, only its kneighbors method in order to obtain the distance to the fourth neighbour.

Numpy arrays have a sort method for sorting in place. This sorts the array from smallest to largest values, but you can invert any array by simply slicing it this way: a = a[::-1].

To plot your clusters, you can use the function below as a starting point. Feel free to adapt it, but you need the image Mollweide_projection_SW.jpg (surce: Wikipedia) in your work folder for this function to work.

def plot_classes(labels,lon,lat, alpha=0.5, edge = 'k'):
    """Plot seismic events using Mollweide projection.
    Arguments are the cluster labels and the longitude and latitude
    vectors of the events"""
    img = imread("Mollweide_projection_SW.jpg")        
    x = lon/180*np.pi
    y = lat/180*np.pi
    ax = plt.subplot(111, projection="mollweide")
    print(ax.get_xlim(), ax.get_ylim())
    t = ax.transData.transform(np.vstack((x,y)).T)
    clims = np.array([(-np.pi,0),(np.pi,0),(0,-np.pi/2),(0,np.pi/2)])
    lims = ax.transData.transform(clims)
    x = t[:,0]
    y= t[:,1]
    nots = np.zeros(len(labels)).astype(bool)
    diffs = np.unique(labels)    
    ix = 0   
    for lab in diffs[diffs>=0]:        
        mask = labels==lab
        nots = np.logical_or(nots,mask)        
        plt.plot(x[mask], y[mask],'o', markersize=4, mew=1,zorder=1,alpha=alpha, markeredgecolor=edge)
        ix = ix+1                    
    mask = np.logical_not(nots)    
    if np.sum(mask)>0:
        plt.plot(x[mask], y[mask], '.', markersize=1, mew=1,markerfacecolor='w', markeredgecolor=edge)

The output will be something like the images shown above. These images can be useful to help you assess each clustering algorithm, especially when comparing with the fault lines assignment. Note that you can also plot the fault lines assignment of the seismic events by using the fault information in place of the cluster labels.

Guidelines for the report

The report will be the most important part of this assignment. In the report you must:

  • Explain any preprocessing of your data. Should you standardize the data? Is it necessary to shuffle the data? Describe and justify any such processing decisions you make or justify why you did not preprocess your data.
  • Explain the three clustering algorithms, focusing on how they differ and how their characteristics may be relevant for this clustering problem.
  • Discuss how adequate or inadequate each different score used is for this clustering problem.
  • Justify your selection of the best parameter value for each clustering algorithm relating the scores, the data and the algorithm.
  • Explain your implementation of the method for selecting the ε parameter in DBSCAN.
  • Show the different graphs for the parameter optimization, on all algorithms.
  • Discuss the advantages or disadvantages of using each of these algorithms in this data set and recommend one for clustering these data, justifying your choice.

The main criteria for grading your report will be an assessment of your understanding of the clustering algorithms and the indexes used to evaluate the parameters, and their relevance for this data set.