Comparing Clusters (061)

Comparing Clusters illustrates how the same dataset can form different clusters when run through various algorithms.

All the blocks in the Bronx are ordered by four different clustering algorithms: kMeans, GMM, agglomeration clustering, and affinity propagation. Clicking on a block in one representation isolates the block’s corresponding cluster in each of the other mappings, and sets these clusters to the same random color. As there isn’t a one-to-one translation between clusters in each algorithm, the change in color allows users to incrementally construct a color scheme particular to the blocks they click.

Multiple representations challenge the authority and determination of each algorithm. Furthermore, through comparison, the various interpretations and parameters for similarity are made evident.

Technicals

The map is displayed using d3.js using the geoMercator projection, with each algorithm being drawn to an individual canvas.

The shape descriptors developed in Looking for The Same (058) are used as the input data for each algorithm. When required by the algorithm, the number of clusters specified was ten. The scikit python library was used for each: kmeans, gaussian mixture modeling, agglomerative hierarchical clustering, and affinity propagation.

K-Means uses a centroid model of clustering in which similarity is derived by distance of data point to a set number of centroids. With each iteration, data points are assigned to the nearest centroid. Then, centroids take the average position of the data points assigned to them. This is repeated until the centroids reach stability.

Gaussian Mixture Modeling is a probabilistic distribution model using the probability of all data points in a cluster belonging to the same distribution.

Agglomerative Hierarchical Clustering uses distance between data points to identify similarity. Data points are initially classified into separate clusters and then merges as distance decreases between data points.

Affinity Propagation is a type of “message passing” model which finds exemplars, as representatives of clusters, within the data set. The number of clusters is not required to be specified and all data points are potential exemplars. Messages are exchanged between pairs of data points until a set of exemplars and corresponding clusters emerge. Aneesha Bakharia provides a nice write-up on affinity propagation.

Technicals

One thought on “Comparing Clusters (061)”