Overview

In this lesson, we'll learn about distance and similarity metrics.

Outcomes

After completing this lesson, you'll be able to ...

calculate the dot product of two vectors
calculate the Euclidean distance of two vectors
compare magnitude and direction of vectors
calculate the $p$ -norm of a vector
calculate the cosine similarity of two vectors

Prerequisites

Before you start, ensure you're ...

comfortable with the notions of vector magnitude and direction
- comfortable computing the magnitude of a vector
- comfortable normalizing vectors
comfortable thinking of n-dimensional vectors as n-dimensional points in space

Comparing distance and direction

As we saw earlier, it can be useful to think of vectors as positions in space. If we want to compare vectors, we can measure the distance between these points, differences in their direction, or a combination of these components. The most common comparison operation involving both magnitude and direction is the dot product.

Dot product

The dot product is an element-wise product and then a summation of these products:

\mathbf{a} \cdot \mathbf{b} = \sum_{i=0}^{\vert \mathbf{a} \vert} a_{i} b_{i}

where $\vert \mathbf{a} \vert = \vert \mathbf{b} \vert$ .

Note that a zero in one vector will cancel out the contribution of the corresponding dimension in the other other vector. Let's look at an example ...

\begin{aligned} \mathbf{x} &= \begin{bmatrix} 0 & 2 & 4 \end{bmatrix} \\[2em] \mathbf{y} &= \begin{bmatrix} 3 & 4 & 3 \end{bmatrix} \\[2em] \mathbf{x} \cdot \mathbf{y} &= (0 \times 3) + (2 \times 4) + (4 \times 3) \\[2em] \mathbf{x} \cdot \mathbf{y} &= 20 \end{aligned}

Note that in the example above $x_{0} = 0$ , which means the first dimension of both $\mathbf{x}$ and $\mathbf{y}$ does not contribute to $\mathbf{x} \cdot \mathbf{y}$ .

Using NumPy

Invoke the IPython interpreter via docker:

docker run -it "parsertongue/python:latest" ipython

# run using the following command:
# docker run -it "parsertongue/python:latest" ipython

import numpy as np

x = np.array([0, 2, 4])
y = np.array([3, 4, 3])

x.dot(y)

The result:

Comparing lengths

There are an infinite number of ways we can calculate distance between pairs of vectors. We'll focus on two methods, compare them, and see how generalize to the same form.

Manhattan distance

Manhattan or city block distance measures distance in a manner similar to the route an efficient taxi takes through grid-like city streets:

d_{1}(\mathbf{a}, \mathbf{b}) = \sum_{i=0}^{\vert \mathbf{a} \vert} \text{abs}(a_{i} - b_{i})

where $\mathbf{x}$ and $\mathbf{y}$ are two vectors with the same dimensions.

"Manhattan distance (visualized)." — Manhattan distance for a pair of vectors.

The Manhattan distance between $\mathbf{a}$ and $\mathbf{b}$ is length of the dotted line in the image above. Imagine an invisible building exists in the rectangle delimited by $\mathbf{a}$ and $\mathbf{b}$ . As an earthbound taxi driver starting at point $\mathbf{a}$ , you must drive around this obstacle to reach $\mathbf{b}$ .

Let's look at an example ...

\begin{aligned} \mathbf{x} &= \begin{bmatrix} 0 & 2 & 4 \end{bmatrix} \\[2em] \mathbf{y} &= \begin{bmatrix} 3 & 4 & 3 \end{bmatrix} \\[2em] d_{1}(\mathbf{x}, \mathbf{y}) &= \sum_{i=0}^{\vert \mathbf{x} \vert} \text{abs}(x_{i} - y_{i}) \\[2em] d_{1}(\mathbf{x}, \mathbf{y}) &= \text{abs}(0-3) + \text{abs}(2-4) + \text{abs}(4-3) d_{1}(\mathbf{x}, \mathbf{y}) &= 6 \end{aligned}

Euclidean distance

Euclidean distance is the most direct path between two points. It is distance as the crow flies:

d_{2}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=0}^{\vert \mathbf{a} \vert} (a_{i} - b_{i})^{2}}

where $\mathbf{x}$ and $\mathbf{y}$ are two vectors with the same dimensions.

"Euclidean distance (visualized)." — Euclidean distance for a pair of vectors.

The Euclidean distance of a vector $\mathbf{x}$ and the origin (a vector comprised of all zeroes) is the magnitude (length) of $\mathbf{x}$ .

Let's look at an example ...

\begin{aligned} \mathbf{x} &= \begin{bmatrix} 0 & 2 & 4 \end{bmatrix} \\[2em] \mathbf{y} &= \begin{bmatrix} 3 & 4 & 3 \end{bmatrix} \\[2em] d_{2}(\mathbf{x}, \mathbf{y}) &= \sqrt{\sum_{i=0}^{\vert \mathbf{x} \vert} (x_{i} - y_{i})^{2}} \\[2em] d_{2}(\mathbf{x}, \mathbf{y}) &= \sqrt{(0-3)^{2} + (2-4)^{2} + (4-3)^{2}} \\[2em] d_{2}(\mathbf{x}, \mathbf{y}) &= \sqrt{9 + 4 + 1} \\[2em] d_{2}(\mathbf{x}, \mathbf{y}) &\approx 3.742 \end{aligned}

Generalized distance

The formulae for Manhattan and Euclidean distance are very similar:

\begin{aligned} d_{1}(\mathbf{a}, \mathbf{b}) &= \sum_{i=0}^{\vert \mathbf{a} \vert} \text{abs}(a_{i} - b_{i}) \\[2em] d_{2}(\mathbf{a}, \mathbf{b}) &= \sqrt{\sum_{i=0}^{\vert \mathbf{a} \vert} (a_{i} - b_{i})^{2}} \end{aligned}

We can generalize them:

\begin{aligned} d_{1}(\mathbf{a}, \mathbf{b}) &= \sum_{i=0}^{\vert \mathbf{a} \vert} \text{abs}(a_{i} - b_{i}) \\[2em] d_{2}(\mathbf{a}, \mathbf{b}) &= \sqrt{\sum_{i=0}^{\vert \mathbf{a} \vert} (a_{i} - b_{i})^{2}} \end{aligned}

Alternatively, ...

\begin{aligned} d_{1}(\mathbf{a}, \mathbf{b}) &= [\sum_{i=0}^{\vert \mathbf{a} \vert} \text{abs}(a_{i} - b_{i})^{1}]^{\frac{1}{1}} \\[2em] d_{2}(\mathbf{a}, \mathbf{b}) &= [\sum_{i=0}^{\vert \mathbf{a} \vert} \text{abs}(a_{i} - b_{i})^{2}]^{\frac{1}{2}} \end{aligned}

We can generalize this kind of distance as ...

d_{m}(\mathbf{a}, \mathbf{b}) = [\sum_{i=0}^{\vert \mathbf{a} \vert} (a_{i} - b_{i})^{m}]^{\frac{1}{m}}

Comparing directions

In the previous lesson, we looked at how to remove the magnitude component from a vector to leave only the direction component.

Recall that notation used for the normalization term included a subscripted 2 (i.e., $\| \mathbf{x} \|_{2}$ ). Another name for this norm is the 2-norm. The 2-norm is simply the Euclidean distance of a vector $\mathbf{x}$ and the origin:

"2-norm (visualized)." — 2-norm visualized for a single vector.

Just as we saw Euclidean distance could be generalized, so can the 2-norm. The 2-norm is a special case of the more general concept of $p$ -norm when $p = 2$ .

p-norm

The $p$ -norm takes the following form:

\| \mathbf{x} \|_{p} = [\sum_{i=0}^{\vert \mathbf{x} \vert} \text{abs}(x_{i})^{p}]^{\frac{1}{p}}

Vector normalization

As we saw previously, the normalized form of a vector $\textbf{x}$ uses the 2-norm to normalize each term in some vector $\mathbf{x}$ to convert it to a vector of unit length:

\begin{aligned} \text{norm}(\mathbf{x}) &= \frac{\mathbf{x}}{\vert \vert x \vert \vert_{2}} \\[2em] &= \frac{\mathbf{x}}{\sqrt{\sum_{i=0}^{\vert \mathbf{x} \vert} \text{abs}(x_{i})^{2}}} \\[2em] \hat{\mathbf{x}} &= \frac{\mathbf{x}}{\left[\sum_{i=0}^{\vert \mathbf{x} \vert} \text{abs}(x_{i})^{2}\right]^{\frac{1}{2}}} \end{aligned}

"Vector normalization (visualized)." — Vector normalization visualized for a single vector. The Euclidean distance between the normalized vector and the origin is 1.

Ignoring the magnitude component, we can compare two unit-length vectors in terms of the angle formed between them by way of the origin.

Cosine similarity

Cosine similarity measures the cosine of angle $\theta$ formed between two normalized vectors:

\begin{aligned} cos(\mathbf{a}, \mathbf{b}) &= \frac{\mathbf{a} \cdot \mathbf{b}}{\| \mathbf{a} \|_{2} \| \mathbf{b} \|_{2}} \end{aligned}

"Cosine similarity (visualized)." — The cosine of the angle between a pair of vectors is the cosine similarity of that pair of vectors.

Next steps

You now know a few methods for comparing vectors. Let's practice ...

Practice

When does $\text{cos}(\mathbf{a}, \mathbf{b}) = \mathbf{a} \cdot \mathbf{b}$ ?
If $\sum_{i=0}^{ \vert \mathbf{x} \vert } x_{i}^{2} = 1$ , what can you say about $\mathbf{x}$ ?
What is the magnitude of a normalized vector?
Calculate cosine similarity for the following two vectors:

\begin{aligned} \mathbf{a} &= \begin{bmatrix} 4 & -3 & 16 \\ \end{bmatrix} \\[2em] \mathbf{b} &= \begin{bmatrix} 0 & 7 & 7 \\ \end{bmatrix} \end{aligned}

cd ~/