In this lesson, we'll learn about -grams and see how they can be used as general features to represent words and documents.
After completing this lesson, you'll be able to ...
Before you start, ...:
As we discussed in the previous lesson, using sequences of characters and tokens as features is a common practice in NLP. Rather than limiting such features to a shortlist of specific morphemes in the form of prefixes or suffixes, we can take a general approach and include all character or token sequences of some length . These sequences of length tokens or characters are known as -grams:
Name | |
---|---|
1 | unigrams |
2 | bigrams |
3 | trigrams |
Beyond 3, these are simply referred to using the value of that is selected (ex. 4-grams).
In practice, it is not unusual to use the union of features for multiple values of (ex. (1-grams 2-grams 3-grams)).
How do we come up with the set of all -grams of a particular length? For example, if we wanted to generate all character bigrams, what approach should we take?
Your first intuition might be that we can just generate all possible sequences of length 2 based on some alphabet of symbols (i.e., the Cartesian product), and then check which can be found in each word we want to represent. In English, assuming we case fold, that alphabet might be the letters a-z along with digits and some punctuation (including a whitespace).
What are the problems with this approach?
For one, it ends up being a large set of features:
where is the length of the sequence we want to use.
In other words, any character can follow any character. For each of the letters in our alphabet, there are possible pairs. If this isn't clear, try manually generating all possible sequences for a very small alphabet:
# all possible pairs of "b" & "b"
naive_char_bigrams = [
("a", "a"),
("a", "b"),
("b", "a"),
("b", "b")
]
# alternatively, ...
from itertools import product
list(product(["a","b"], repeat=2))
# SPOILER ALERT: they're the same!
set(naive_char_bigrams) == set(product(["a","b"], repeat=2))
Let's consider the case where is comprised of only 26 letters (a-z). How many character bigrams can we generate?
from itertools import product
from string import ascii_lowercase
# count bigrams using cartesian product
total = sum(1 for bigram in product(ascii_letters, repeat=2))
print(total)
Problem: As increases, this number gets very big!
Total character -grams (assuming ) | |
---|---|
1 | 26 |
2 | 676 |
3 | 17576 |
4 | 456976 |
┗|・o・|┛ 工エエェェ(;╹⌓╹)ェェエエ工
There is another problem, though, which is that many of these sequences are not phonotactically valid.
Problem: We inadvertently generate many sequences that will never occur in text.
Sequences like zg
, gz
, qm
, etc. are just not possible in English1.
Instead of generating all (im)possible sequences, why not just generate the set of all observed sequences of some length?
In order to generate only the observed -grams, we need to take one pass over all of our data. For example, if wanted to generate character trigrams to represent 200 words, we need to do the following:
For each word, construct a list of -grams by sliding an -sized window across our the word from left to right. Add each -gram in the result to a growing set of all -grams.
# NOTE: the implementation of character_ngrams
# is left as an exercise for the reader
# our set of all features.
all_features = {}
for word in words:
for ngram in character_ngrams(word, n=2):
all_features.add(ngram)
What do the bigrams look like for a single word?
res = character_ngrams("vivid", n=2)
print(res)
["vi", "iv", "vi", "id"]
One variation of this algorithm prepends and appends a special start and end symbol to each word to better detect word boundaries:
res = character_ngrams("vivid", n=2, with_start_end=True)
print(res)
["<S>v", "vi", "iv", "vi", "id", "d</S>"]
Why might this be useful?
When using -grams as features, one can elect to use binary values, counts, or (as we'll see in the next unit) probabilities.
Let's look at an example of binary and count-based representations using the previous example:
res = character_ngrams("vivid", n=2)
print(res)
["vi", "iv", "vi", "id", "d"]
Word | ... | vi | iv | id | go | os | ... |
---|---|---|---|---|---|---|---|
vivid | ... | 1 | 1 | 1 | 0 | 0 | ... |
The dots represent omitted -grams observed in other words in our data (ex. "mo" from "mom", etc.).
Word | ... | vi | iv | id | go | os | ... |
---|---|---|---|---|---|---|---|
vivid | ... | 2 | 1 | 1 | 0 | 0 | ... |
Remember, we saw the bigram vi
twice in vivid
.
from collections import Counter
counts = Counter(character_ngrams("vivid", n=2))
print(counts.most_common())
The algorithm for generating -grams works the same for characters or words/tokens. Just as a document can be conceived of as a sequence of tokens, you can think of a word as a sequence of characters.
You now understand the basics of using -grams as features to represent word and document vectors. Let's practice ...
<S>
and </S>
?Recall that a popular variation of the -gram algorithm prepends and appends special characters to recognize -grams constituting the start and end of the sequence. Why might this be useful?
List all word bigrams for the following sequence of tokens without special start and end symbols:
["I", "know", "kung", "fu"]
List all word bigrams for the following sequence of tokens with special start and end symbols:
["I", "know", "kung", "fu"]
List all word trigrams for the following sequence of tokens without special start and end symbols:
["I", "know", "kung", "fu"]
List all word trigrams for the following sequence of tokens with special start and end symbols:
["I", "know", "kung", "fu"]