Overview

In this lesson, we'll learn some of the key concepts behind clustering.

Outcomes

After completing this lesson, you'll be able to ...

describe the difference between a centroid and a medoid
calculate a medoid
calculate a centroid

Prerequisites

Before you start, ensure you're ...

Clustering

Clustering is a semi-supervised or unsupervised technique for partitioning data in a set of 2 or more clusters (groups). It is often use as an exploratory technique in data science, but it can also be used as a form of classification (i.e., assigning one or more labels to each datapoint).

Semi-supervised clustering

In semi-supervised clustering, you create a similarity-based classifier with a small portion of labeled datapoints to serve as seeds to various clusters.

Imagine, for example, you have created a set of word representations and want to group them according to whether or not the word can behave as a noun. Maybe you've identified 10 words that can (cluster A) and another 10 that can't (cluster B). The rest of your data is unlabeled. Your goal is then to assign the remaining datapoints (word representations) to one of these two clusters by determining whether an unlabeled word representation is more similar to cluster A or cluster B.

Examples of semi-supervised clustering algorithms include k nearest neighbors (_k_NN) and the nearest centroid classifier (Rocchio classifier). Many of the unspervised clustering algorithms have variations that work in semi-supervised settings.

Unsupervised clustering

In unsupervised clustering, you have no labeled data. Your aim is simply to partition the data according to some similarity or distance function, but you may or may not know how many clusters you want at the end of this process.

Examples of unsupervised clustering algorithms include hierarchical agglomerative clustering (HAC), DBSCAN, k-means, and k-medoids.

Finding the middle of a cluster

Many approaches to clustering rely on measuring and comparing distance to the "middle" of various clusters. There are two primary notions of center or middle that we'll discuss: the centroid and medoid.

Centroid

Given a group or matrix of vectors $\mathbf{X}$ , the centroid is the average vector from this group:

c_{\mathbf{X}} = \frac{1}{\vert \mathbf{X} \vert}\sum_{\mathbf{x}_{i} \in \mathbf{X}} \mathbf{x}_{i}

In other words, the centroid is the center of this group of vectors. While it's possible the centroid is one of the vectors in $\mathbf{X}$ , that's uncommon.

Let's work through an example...

\begin{aligned} \mathbf{X} &= \begin{bmatrix} 4 & 3 & 2 \\ 11 & 3 & 1 \\ 9 & 3 & 3 \\ \end{bmatrix} \\[2em] c_{\mathbf{X}} &= \begin{bmatrix} \frac{(4 + 11 + 9)}{3} & \frac{(3 + 3 + 3)}{3} & \frac{(2 + 1 + 3)}{3} \\ \end{bmatrix} \\[2em] c_{\mathbf{X}} &= \begin{bmatrix} 8 & 3 & 2 \\ \end{bmatrix} \end{aligned}

"Centroid (visualized)."

When using vectorized operations with a library such as NumPy, it's easier to decompose this into a column-wise summation and scalar division.

Medoids

Given a group of vectors $\mathbf{X}$ , the medoid of $\mathbf{X}$ is the vector in $\mathbf{X}$ with the smallest average distance with every other vector in $\mathbf{X}$ according to some distance function $d$ (e.g. Euclidean distance):

\mathbf{X}_{\text{medoid}} = \underset{\mathbf{v} \in \mathbf{X}}{\argmin}\frac{\sum_{\mathbf{x}_{i} \in X} d(\mathbf{v}, \mathbf{x}_{i})}{\vert X \vert}

$d$ is some distance function.
$\argmin$ finds the $v$ with the minimum average distance.
- $v$ is an actual vector in $\mathbf{X}$
- $\vert \mathbf{X} \vert$ represents the number of rows (vectors) in $\mathbf{X}$