Overview

In this lesson, we'll learn some of the key concepts behind clustering.

Outcomes

After completing this lesson, you'll be able to ...

  • describe the difference between a centroid and a medoid
  • calculate a medoid
  • calculate a centroid

Prerequisites

Before you start, ensure you're ...

Clustering

Clustering is a semi-supervised or unsupervised technique for partitioning data in a set of 2 or more clusters (groups). It is often use as an exploratory technique in data science, but it can also be used as a form of classification (i.e., assigning one or more labels to each datapoint).

Semi-supervised clustering

In semi-supervised clustering, you create a similarity-based classifier with a small portion of labeled datapoints to serve as seeds to various clusters.

Imagine, for example, you have created a set of word representations and want to group them according to whether or not the word can behave as a noun. Maybe you've identified 10 words that can (cluster A) and another 10 that can't (cluster B). The rest of your data is unlabeled. Your goal is then to assign the remaining datapoints (word representations) to one of these two clusters by determining whether an unlabeled word representation is more similar to cluster A or cluster B.

Examples of semi-supervised clustering algorithms include k nearest neighbors (_k_NN) and the nearest centroid classifier (Rocchio classifier). Many of the unspervised clustering algorithms have variations that work in semi-supervised settings.

Unsupervised clustering

In unsupervised clustering, you have no labeled data. Your aim is simply to partition the data according to some similarity or distance function, but you may or may not know how many clusters you want at the end of this process.

Examples of unsupervised clustering algorithms include hierarchical agglomerative clustering (HAC), DBSCAN, k-means, and k-medoids.

Finding the middle of a cluster

Many approaches to clustering rely on measuring and comparing distance to the "middle" of various clusters. There are two primary notions of center or middle that we'll discuss: the centroid and medoid.

Centroid

Given a group or matrix of vectors X\mathbf{X}, the centroid is the average vector from this group:

cX=1XxiXxic_{\mathbf{X}} = \frac{1}{\vert \mathbf{X} \vert}\sum_{\mathbf{x}_{i} \in \mathbf{X}} \mathbf{x}_{i}

In other words, the centroid is the center of this group of vectors. While it's possible the centroid is one of the vectors in X\mathbf{X}, that's uncommon.

Let's work through an example...

X=[4321131933]cX=[(4+11+9)3(3+3+3)3(2+1+3)3]cX=[832]\begin{aligned} \mathbf{X} &= \begin{bmatrix} 4 & 3 & 2 \\ 11 & 3 & 1 \\ 9 & 3 & 3 \\ \end{bmatrix} \\[2em] c_{\mathbf{X}} &= \begin{bmatrix} \frac{(4 + 11 + 9)}{3} & \frac{(3 + 3 + 3)}{3} & \frac{(2 + 1 + 3)}{3} \\ \end{bmatrix} \\[2em] c_{\mathbf{X}} &= \begin{bmatrix} 8 & 3 & 2 \\ \end{bmatrix} \end{aligned}

"Centroid (visualized)."

When using vectorized operations with a library such as NumPy, it's easier to decompose this into a column-wise summation and scalar division.

Medoids

Given a group of vectors X\mathbf{X}, the medoid of X\mathbf{X} is the vector in X\mathbf{X} with the smallest average distance with every other vector in X\mathbf{X} according to some distance function dd (e.g. Euclidean distance):

Xmedoid=arg minvXxiXd(v,xi)X\mathbf{X}_{\text{medoid}} = \underset{\mathbf{v} \in \mathbf{X}}{\argmin}\frac{\sum_{\mathbf{x}_{i} \in X} d(\mathbf{v}, \mathbf{x}_{i})}{\vert X \vert}
  • dd is some distance function.
  • arg min\argmin finds the vv with the minimum average distance.
    • vv is an actual vector in X\mathbf{X}
    • X\vert \mathbf{X} \vert represents the number of rows (vectors) in X\mathbf{X}

In other words, the medoid is the vector in some group that is the most central according to some distance metric dd.

Let's work through an example using Euclidean distance (d2d_{2}):

X=[4321131933]X=[abc]Xmedoid=arg minvXxiXd(v,xi)XXmedoid=arg minvXd2(v,a)+d2(v,b)+d2(v,c)3Xmedoid=cXmedoid=[933]\begin{aligned} \mathbf{X} &= \begin{bmatrix} 4 & 3 & 2 \\ 11 & 3 & 1 \\ 9 & 3 & 3 \\ \end{bmatrix} \\[2em] \mathbf{X} &= \begin{bmatrix} \mathbf{a} \\ \mathbf{b} \\ \mathbf{c} \\ \end{bmatrix} \\[2em] \mathbf{X}_{\text{medoid}} &= \underset{\mathbf{v} \in \mathbf{X}}{\argmin}\frac{\sum_{\mathbf{x}_{i} \in \mathbf{X}} d(\mathbf{v}, \mathbf{x}_{i})}{\vert X \vert} \\[2em] \mathbf{X}_{\text{medoid}} &= \underset{\mathbf{v} \in \mathbf{X}}{\argmin}\frac{d_{2}(\mathbf{v}, \mathbf{a}) + d_{2}(\mathbf{v}, \mathbf{b}) + d_{2}(\mathbf{v}, \mathbf{c})}{3} \\[2em] \mathbf{X}_{\text{medoid}} &= \mathbf{c} \\[2em] \mathbf{X}_{\text{medoid}} &= \begin{bmatrix} 9 & 3 & 3 \\ \end{bmatrix} \end{aligned}

"Medoid (visualized)."

Next steps

You now know some of the key concepts behind clustering. Let's practice ...

Practice

  • When is the medoid a better measure of centrality than a matrice's centroid?
  • Is it more efficient to compute the centroid or medoid?
  • Calculate the centroid for the following matrix:
[41224185173]\begin{bmatrix} -4 & -1 & 22 \\ 4 & -1 & 8 \\ 5 & -1 & 73 \\ \end{bmatrix}
  • Calculate the medoid for the following matrix:
[41224185173]\begin{bmatrix} -4 & -1 & 22 \\ 4 & -1 & 8 \\ 5 & -1 & 73 \\ \end{bmatrix}
  • Given X=<0,1,2,3,4,5>X = <0,1,2,3,4,5>, what is the arg min\argmin for the following function?
arg minxXf(x)=x3\argmin_{x \in X} f(x) = x^{3}
cd ~/Creative Commons License