Clustering is a semi-supervised or unsupervised technique for partitioning data in a set of 2 or more clusters (groups). It is often use as an exploratory technique in data science, but it can also be used as a form of classification (i.e., assigning one or more labels to each datapoint).
Semi-supervised clustering
In semi-supervised clustering, you create a similarity-based classifier with a small portion of labeled datapoints to serve as seeds to various clusters.
Imagine, for example, you have created a set of word representations and want to group them according to whether or not the word can behave as a noun. Maybe you've identified 10 words that can (cluster A) and another 10 that can't (cluster B). The rest of your data is unlabeled. Your goal is then to assign the remaining datapoints (word representations) to one of these two clusters by determining whether an unlabeled word representation is more similar to cluster A or cluster B.
In unsupervised clustering, you have no labeled data. Your aim is simply to partition the data according to some similarity or distance function, but you may or may not know how many clusters you want at the end of this process.
Many approaches to clustering rely on measuring and comparing distance to the "middle" of various clusters. There are two primary notions of center or middle that we'll discuss: the centroid and medoid.
Centroid
Given a group or matrix of vectors X, the centroid is the average vector from this group:
cX=∣X∣1xi∈X∑xi
In other words, the centroid is the center of this group of vectors. While it's possible the centroid is one of the vectors in X, that's uncommon.
When using vectorized operations with a library such as NumPy, it's easier to decompose this into a column-wise summation and scalar division.
Medoids
Given a group of vectors X, the medoid of X is the vector in X with the smallest average distance with every other vector in X according to some distance function d (e.g. Euclidean distance):
Xmedoid=v∈Xargmin∣X∣∑xi∈Xd(v,xi)
d is some distance function.
argmin finds the v with the minimum average distance.
v is an actual vector in X
∣X∣ represents the number of rows (vectors) in X
In other words, the medoid is the vector in some group that is the most central according to some distance metric d.
Let's work through an example using Euclidean distance (d2):