In this lesson, we'll learn about estimating the probability of -grams and see how we can approximate the probability of a document using -gram probabilities.
After completing this lesson, you'll be able to ...
Before you start, ensure you're ...
In the previous lesson, we learned some of the basics of probability. Now we'll look at how that can be applied to estimate the probability of some sequence (ex. a sentence, document, etc.).
Estimating the probability of some sequence? Why would we want to do such a thing? Have you ever been frustrated by a strange autocorrect or autocomplete? That was the result of a poor language model. A language model is a probability distribution over sequences of words or characters estimated using large text corpora. Sequence probabilities play an important role in a number of natural language processing tasks including machine translation, speech recognition, named entity recognition, and part of speech tagging.
Given their broad usage, let's take a look at how to estimate sequence probabilities...
We can compute the probability of a sequence using the chain rule which tells us that the joint probability of an arbitrary sequence of words is just a product of conditional probabilities:
Using a more concrete example ...
The chain rule is really just an extension of the definition of conditional probability and its relation to the joint probability.3
If we wanted to apply the chain rule to estimating the likelihood of a sentence, though, we'd have a hard time estimating the likelihood of many of these subsequences (there are an infinite number of possible sentences).
If we can't compute the probability of an arbitrarily long sequence directly, How can we at least approximate it?
One option is to make a powerful simplifying assumption that we can approximate a complete history using a more limited context (history):
This simplification is known as the Markov assumption.
For example, with ,
Language models that use this kind of simplifying assumption are known as Markov chains. The size of the history window determines the order4 of the Markov chain.
An order zero Markov chain makes no use of history and only ever considers the current word.
An order 1 Markov chain makes use of the immediately preceding token (i.e., a bigram model).
| Order | -gram |
|---|---|
| 0 | unigram |
| 1 | bigram |
| 2 | trigram |
The higher the order of the model used, the rarer the -grams become.
To practice, let's write out what the probability calculation would be using different order Markov chains for the following sequence:
the man in black fled across the desert
To understand the extreme of this simplifying assumption, let's examine the Order 0 Markov model:
One problem with this strong independence assumption is that ungrammatical sequences can end up with higher probabilities than grammatical sequences:
Another issue with the lowest order model is that it is invariant to order:
As a point of comparison, let's consider an Order 1 model:
While it's common to use a higher order Markov chain, an Order 1 model already addresses the two shortcomings we noted about the Order 0 model.
We can estimate -gram probabilities in the "real world" by counting occurrences in some collection of documents (a corpus):
docs = [
["SOS", "i", "like", "turtles", "EOS"],
["SOS", "i", "like", "tortoises", "EOS"],
["SOS", "some", "turtles", "are", "tortoises", "EOS"],
["SOS", "all", "tortoises", "are", "turtles", "EOS"]
]First, let's count occurrences of each token across all documents:
| Unigram | TOTAL |
|---|---|
SOS | 4 |
EOS | 4 |
all | 1 |
are | 2 |
i | 2 |
like | 2 |
some | 1 |
tortoises | 3 |
turtles | 3 |
| TOTAL | 22 |
... and now our bigrams:
SOS | EOS | all | are | i | like | some | tortoises | turtles | |
|---|---|---|---|---|---|---|---|---|---|
SOS | 0 | 0 | 1 | 0 | 2 | 0 | 1 | 0 | 0 |
EOS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
all | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
are | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
i | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
like | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
some | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
tortoises | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
turtles | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
The first column represents the first element in the bigram (), while the remaining columns represent the word word that immediately follows (). Cells with a value of 0 indicate the bigram wasn't observed in the data. Using these two tables, we can calculate :
Notice we can also calculate this using only the second table by normalizing the count in a cell using the sum of its row ...
Think for a moment why that is the case.
The row represents the counts of all of the -grams (bigrams) in which the given term (SOS) appears as the first element.
Even with the simplifying assumption, we still run into a number of problems.
However large the corpus we use to estimate our sequence probabilities, we'll never be able to include all words. How do we handle cases of unseen words? Consider an Order 0 Markov chain. Assigning a zero probability to any term in the sequence will mean the probability of the entire sequence becomes 0. How can we accomodate an open vocabulary? One option is to replace all infrequent tokens below some threshold (ex. ) with special symbol such as <UNK> to represent unseen/unknown words. If we tally our counts, we'll come with a non-zero probability for <UNK> which we can substitute in when we attempt to estimate the probability of an unknown word.
The higher the order the language model, the richer the contextual information that can be represented.
But there is a catch: as you increase the length of the -gram, you encounter fewer occurrences. In fact a large number of grammatically correct -grams will never be encountered and thus indvertently assigned a value of 0.
There are many ways of addressing unreliable counts and zero values for higher order -grams by smoothing or making adjustments to a probability distribution to improve reliability of estimates and avoid zeros.
One alternative to smoothing that works better for language modeling is a form of interpolation where we represent -gram probabilities as a linearly interpolated sum of component probabilities. See Section 3.4.3 from the 3rd edition of Jurafsky and Martin's Speech and Language Processing for an overview of this technique.
Probabilities range from 0 to 1. This means a product of probabilities approaches zero as the number of terms in the product increases.
When representing increasingly small values, computers are subject to arithmetic underflow: at some point the computer can no longer distinguish between a very small number and zero. If we're estimating sequence probabilities, we're likely to end up with many false zeros due to underflow. The trick for avoiding this is to just convert all probabilities to log probabilities by taking the log (using any base we like) and then summing in place of multiplication.
If we convert our probabilities to log probabilities, we need to use sums instead of products:
Let's demonstrate this with code:
from math import log, exp, isclose
p_a = .1
p_b = .4
prod = p_a * p_b
# convert to logspace, sum, and
# exponentiate to leave log space
log_base = 10
log_sum = log_base ** (log(p_a, log_base) + log(p_b, log_base))
# because of precision limitations,
# these numbers won't be identical
# use isclose to test equality
isclose(prod, log_sum)Using sums of log probabilities gives us far more numerical stability than we had when dealing with vanishingly small products of probability chains.
You now know how to estimate the probability of an -gram and use such probabilities to estimate the likelihood of longer sequences (ex. sentences). Let's practice ...
If , what does this tell us about and ?
What types of tokens are more frequent than others?