Overview

In this lesson, we'll learn about estimating the probability of $n$ -grams and see how we can approximate the probability of a document using $n$ -gram probabilities.

Outcomes

After completing this lesson, you'll be able to ...

estimate the probability of an $n$ -gram
estimate the probability of a document

Prerequisites

Before you start, ensure you're ...

familiar with the basics of probability

Background

In the previous lesson, we learned some of the basics of probability. Now we'll look at how that can be applied to estimate the probability of some sequence (ex. a sentence, document, etc.).

Estimating the probability of some sequence? Why would we want to do such a thing? Have you ever been frustrated by a strange autocorrect or autocomplete? That was the result of a poor language model. A language model is a probability distribution over sequences of words or characters estimated using large text corpora. Sequence probabilities play an important role in a number of natural language processing tasks including machine translation, speech recognition, named entity recognition, and part of speech tagging.

Given their broad usage, let's take a look at how to estimate sequence probabilities...

Sequence probabilities

We can compute the probability of a sequence using the chain rule which tells us that the joint probability of an arbitrary sequence of words is just a product of conditional probabilities:

$P(w_{1}, w_{2}, w_{3}, \ldots, w_{n}) = P(w_{1})P(w_{2} \vert w_{1})P(w_{3} \vert w_{1}, w_{2})\ldots P(w_{n} \vert w_{1}, w_{2}, w_{3},\ldots,w_{n-1})$

Using a more concrete example ...

$P(\text{I've been kissed by a rose on the grey}) =$

$P(\text{I've}) P(\text{been} \vert \text{I've}) P(\text{kissed} \vert \text{I've been}) P(\text{by} \vert \text{I've been kissed})$

$P(\text{a} \vert \text{I've been kissed by}) P(\text{rose} \vert \text{I've been kissed by a}) P(\text{on} \vert \text{I've been kissed by a rose})$

$P(\text{the} \vert \text{I've been kissed by a rose on}) P(\text{grey} \vert \text{I've been kissed by a rose on the})$ ²

The cover of Seal II — Slightly more plausible if you first hear 'There used to be a greying tower alone on the sea ... 🎶'

The chain rule is really just an extension of the definition of conditional probability and its relation to the joint probability.³

If we wanted to apply the chain rule to estimating the likelihood of a sentence, though, we'd have a hard time estimating the likelihood of many of these subsequences (there are an infinite number of possible sentences).

If we can't compute the probability of an arbitrarily long sequence directly, How can we at least approximate it?

Markov assumption

One option is to make a powerful simplifying assumption that we can approximate a complete history using a more limited context (history):

$P(w_{1}, w_{2}, w_{3}, \ldots, w_{n}) \approx \prod_{i} P(w_{i} \vert w_{i-k}, \ldots w_{i-1})$

This simplification is known as the Markov assumption.

For example, with $k=2$ , $\ldots$

$P(\text{desert} \vert \text{the man in black fled across the}) \approx P(\text{desert} \vert \text{across the})$

Language models that use this kind of simplifying assumption are known as Markov chains. The size of the history window determines the order⁴ of the Markov chain.

An order zero Markov chain makes no use of history and only ever considers the current word.
An order 1 Markov chain makes use of the immediately preceding token (i.e., a bigram model).

Order	$n$ -gram
0	unigram
1	bigram
2	trigram

The higher the order of the model used, the rarer the $n$ -grams become.

To practice, let's write out what the probability calculation would be using different order Markov chains for the following sequence:

the man in black fled across the desert

To understand the extreme of this simplifying assumption, let's examine the Order 0 Markov model:

$P(\text{the man in black fled across the desert}) \approx P(\text{the})P(\text{man})P(\text{in})P(\text{black})P(\text{fled})P(\text{across})P(\text{the})P(\text{desert})$

One problem with this strong independence assumption is that ungrammatical sequences can end up with higher probabilities than grammatical sequences:

$P(\text{the})P(\text{the})P(\text{the})P(\text{the}) > P(\text{the})P(\text{man})P(\text{in})P(\text{black})$

Another issue with the lowest order model is that it is invariant to order:

$P(\text{the man in black}) = P(\text{man black the in})$

As a point of comparison, let's consider an Order 1 model:

$P(\text{the man in black fled across the desert}) \approx P(\text{man} \vert \text{the}) P(\text{in} \vert \text{man}) P(\text{black} \vert \text{in}) P(\text{fled} \vert \text{black})$ $P(\text{across} \vert \text{fled}) P(\text{the} \vert \text{across}) P(\text{desert} \vert \text{the})$

While it's common to use a higher order Markov chain, an Order 1 model already addresses the two shortcomings we noted about the Order 0 model.

Estimating $n$ -gram probabilities

We can estimate $n$ -gram probabilities in the "real world" by counting occurrences in some collection of documents (a corpus):

$P(w_{n} \vert w_{n-1}) = \frac{c(w_{n-1}, w_{n})}{c(w_{n-1})}$

docs = [
  ["SOS", "i", "like", "turtles", "EOS"],
  ["SOS", "i", "like", "tortoises", "EOS"],
  ["SOS", "some", "turtles", "are", "tortoises", "EOS"],
  ["SOS", "all", "tortoises", "are", "turtles", "EOS"]
]

First, let's count occurrences of each token across all documents:

Unigram	TOTAL
`SOS`	4
`EOS`	4
`all`	1
`are`	2
`i`	2
`like`	2
`some`	1
`tortoises`	3
`turtles`	3
TOTAL	22

... and now our bigrams:

	`EOS`	`all`	`are`	`i`	`like`	`some`	`tortoises`	`turtles`
`SOS`	0	1	0	2	0	1	0	0
`EOS`	0	0	0	0	0	0	0	0
`all`	0	0	0	0	0	0	1	0
`are`	0	0	0	0	0	0	1	1
`i`	0	0	0	0	2	0	0	0
`like`	0	0	0	0	0	0	1	1
`some`	0	0	0	0	0	0	0	1
`tortoises`	2	0	1	0	0	0	0	0
`turtles`	2	0	1	0	0	0	0	0

The first column represents the first element in the bigram ( $w_{i}$ ), while the remaining columns represent the word word that immediately follows ( $w_{j}$ ). Cells with a value of 0 indicate the bigram wasn't observed in the data. Using these two tables, we can calculate $P( \text{i} \vert \text{SOS})$ :

$P( \text{i} \vert \text{SOS}) = \frac{c( \text{SOS i})}{c( \text{SOS})} = \frac{2}{4} = \frac{1}{2}$

Notice we can also calculate this using only the second table by normalizing the count in a cell using the sum of its row ...

$P(w_{j} \vert w_{i}) = \frac{c(w_{i}, w_{j})}{\sum_{k} c(w_{i}, w_{k})}$

$P(\text{i} \vert \text{SOS}) = \frac{c(\text{SOS i})}{\sum_{k} c(\text{SOS}, w_{k})}$

Think for a moment why that is the case.

The row represents the counts of all of the $n$ -grams (bigrams) in which the given term (SOS) appears as the first element.

Issues

Even with the simplifying assumption, we still run into a number of problems.

Zero values

However large the corpus we use to estimate our sequence probabilities, we'll never be able to include all words. How do we handle cases of unseen words? Consider an Order 0 Markov chain. Assigning a zero probability to any term in the sequence will mean the probability of the entire sequence becomes 0. How can we accomodate an open vocabulary? One option is to replace all infrequent tokens below some threshold (ex. $< 2$ ) with special symbol such as <UNK> to represent unseen/unknown words. If we tally our counts, we'll come with a non-zero probability for <UNK> which we can substitute in when we attempt to estimate the probability of an unknown word.

Sparsity

The higher the order the language model, the richer the contextual information that can be represented.

But there is a catch: as you increase the length of the $n$ -gram, you encounter fewer occurrences. In fact a large number of grammatically correct $n$ -grams will never be encountered and thus indvertently assigned a value of 0.

Smoothing

There are many ways of addressing unreliable counts and zero values for higher order $n$ -grams by smoothing or making adjustments to a probability distribution to improve reliability of estimates and avoid zeros.

One alternative to smoothing that works better for language modeling is a form of interpolation where we represent $n$ -gram probabilities as a linearly interpolated sum of component probabilities. See Section 3.4.3 from the 3rd edition of Jurafsky and Martin's Speech and Language Processing for an overview of this technique.

Vanishing probabilities

Probabilities range from 0 to 1. This means a product of probabilities approaches zero as the number of terms in the product increases.

When representing increasingly small values, computers are subject to arithmetic underflow: at some point the computer can no longer distinguish between a very small number and zero. If we're estimating sequence probabilities, we're likely to end up with many false zeros due to underflow. The trick for avoiding this is to just convert all probabilities to log probabilities by taking the log (using any base we like) and then summing in place of multiplication.

"John Napier" — Jumping John Napier! Naturally, we must be talking logarithms.

⁵

$10^{??} = x$

If we convert our probabilities to log probabilities, we need to use sums instead of products:

$P(a) P(b) = log(P(a)) + log(P(b))$

Let's demonstrate this with code:

from math import log, exp, isclose
p_a  = .1
p_b  = .4
prod = p_a * p_b

# convert to logspace, sum, and 
# exponentiate to leave log space
log_base = 10
log_sum = log_base ** (log(p_a, log_base) + log(p_b, log_base))

# because of precision limitations,
# these numbers won't be identical
# use isclose to test equality
isclose(prod, log_sum)

Using sums of log probabilities gives us far more numerical stability than we had when dealing with vanishingly small products of probability chains.

Next steps

You now know how to estimate the probability of an $n$ -gram and use such probabilities to estimate the likelihood of longer sequences (ex. sentences). Let's practice ...

Practice

If $P(\text{snorkel} \vert \text{mayonnaise}) = P(\text{snorkel})$ , what does this tell us about $snorkel$ and $mayonnaise$ ?
What types of tokens are more frequent than others?

Do you remember these lyrics differently ("grave" instead of "grey")? This is a case of a Mondegreen. ↩
Image from Wikipedia ↩
If you don't remember how conditional probability relates to joint probability, take a look at this tutorial.↩
This context window is also known as a (limited) horizon.↩
Image from Wikipedia If it's been awhile since you've thought about logs, recall that $\text{log}(x, \text{base}=10)$ just gives us the power we need to raise some base to get $x$ :↩

cd ~/👾 Bug?