Overview

In this lesson, we'll learn about $n$ -grams and see how they can be used as general features to represent words and documents.

Outcomes

After completing this lesson, you'll be able to ...

identify character $n$ -grams of various sizes (1, 2, 3, ...)
identify word $n$ -grams of various sizes (1, 2, 3, ...)
use $n$ -grams as features to represent words and documents

Prerequisites

Before you start, ...:

Ensure you're comfortable with using feature vectors to represent words/tokens and documents

Sequences of characters and tokens

As we discussed in the previous lesson, using sequences of characters and tokens as features is a common practice in NLP. Rather than limiting such features to a shortlist of specific morphemes in the form of prefixes or suffixes, we can take a general approach and include all character or token sequences of some length $n$ . These sequences of length $n$ tokens or characters are known as $n$ -grams:

$n$	Name
1	unigrams
2	bigrams
3	trigrams

Beyond 3, these are simply referred to using the value of $n$ that is selected (ex. 4-grams).

In practice, it is not unusual to use the union of features for multiple values of $n$ (ex. (1-grams $\cup$ 2-grams $\cup$ 3-grams)).

How do we come up with the set of all $n$ -grams of a particular length? For example, if we wanted to generate all character bigrams, what approach should we take?

A naïve approach

Your first intuition might be that we can just generate all possible sequences of length 2 based on some alphabet $\Sigma$ of symbols (i.e., the Cartesian product), and then check which can be found in each word we want to represent. In English, assuming we case fold, that alphabet might be the letters a-z along with digits and some punctuation (including a whitespace).

What are the problems with this approach?

For one, it ends up being a large set of features:

$V = |\Sigma|^{n}$ where $n$ is the length of the sequence we want to use.

In other words, any character can follow any character. For each of the $n$ letters in our alphabet, there are $n$ possible pairs. If this isn't clear, try manually generating all possible sequences for a very small alphabet:

$\Sigma = \{a, b\}$

# all possible pairs of "b" & "b"
naive_char_bigrams = [
  ("a", "a"), 
  ("a", "b"), 
  ("b", "a"), 
  ("b", "b")
]

# alternatively, ...
from itertools import product
list(product(["a","b"], repeat=2))

# SPOILER ALERT: they're the same!
set(naive_char_bigrams) == set(product(["a","b"], repeat=2))

$|\Sigma| = 2 \rightarrow 2^{2} = 4$

Let's consider the case where $\Sigma$ is comprised of only 26 letters (a-z). How many character bigrams can we generate?

from itertools import product
from string import ascii_lowercase

# count bigrams using cartesian product
total = sum(1 for bigram in product(ascii_letters, repeat=2))
print(total)

$26^{2} = 676$

Problem: As $n$ increases, this number gets very big!

$n$	Total character $n$ -grams (assuming $\Sigma = 26$ )
1	26
2	676
3	17576
4	456976

┗|・o・|┛ $\rightarrow$ 工ｴｴｪｪ(;╹⌓╹)ｪｪｴｴ工

There is another problem, though, which is that many of these sequences are not phonotactically valid.

Problem: We inadvertently generate many sequences that will never occur in text.

Sequences like zg, gz, qm, etc. are just not possible in English¹.

Instead of generating all (im)possible sequences, why not just generate the set of all observed sequences of some length?

Observed $n$ -grams only

In order to generate only the observed $n$ -grams, we need to take one pass over all of our data. For example, if wanted to generate character trigrams to represent 200 words, we need to do the following:

For each word, construct a list of $n$ -grams by sliding an $n$ -sized window across our the word from left to right. Add each $n$ -gram in the result to a growing set of all $n$ -grams.

# NOTE: the implementation of character_ngrams 
# is left as an exercise for the reader


# our set of all features.
all_features = {}
for word in words: 
  for ngram in character_ngrams(word, n=2):
    all_features.add(ngram)

What do the bigrams look like for a single word?

res = character_ngrams("vivid", n=2)
print(res)

["vi", "iv", "vi", "id"]

One variation of this algorithm prepends and appends a special start and end symbol to each word to better detect word boundaries:

res = character_ngrams("vivid", n=2, with_start_end=True)
print(res)

["<S>v", "vi", "iv", "vi", "id", "d</S>"]

Why might this be useful?

Values for $n$ -gram features

When using $n$ -grams as features, one can elect to use binary values, counts, or (as we'll see in the next unit) probabilities.

Let's look at an example of binary and count-based representations using the previous example:

res = character_ngrams("vivid", n=2)
print(res)

["vi", "iv", "vi", "id", "d"]

Binary values

Word	...	vi	iv	id	go	os	...
vivid	...	1	1	1	0	0	...

The dots represent omitted $n$ -grams observed in other words in our data (ex. "mo" from "mom", etc.).

Count-based values

Word	...	vi	iv	id	go	os	...
vivid	...	2	1	1	0	0	...

Remember, we saw the bigram vi twice in vivid.

from collections import Counter

counts = Counter(character_ngrams("vivid", n=2))
print(counts.most_common())

Characters vs words

The algorithm for generating $n$ -grams works the same for characters or words/tokens. Just as a document can be conceived of as a sequence of tokens, you can think of a word as a sequence of characters.

Next steps

You now understand the basics of using $n$ -grams as features to represent word and document vectors. Let's practice ...

Practice

Right $\rightarrow$ left?

The $n$ -gram algorithm described slides a window across text from left to right. Can you think of a language where you'd want to slide from right to left?

Problems with products

Describe some of the problems with using the Cartesian product to generate all possible sequences of some length $n$ ? What is a practical alternative?

Why use `<S>` and `</S>`?

Recall that a popular variation of the $n$ -gram algorithm prepends and appends special characters to recognize $n$ -grams constituting the start and end of the sequence. Why might this be useful?

Token bigrams

List all word bigrams for the following sequence of tokens without special start and end symbols:
- ["I", "know", "kung", "fu"]
List all word bigrams for the following sequence of tokens with special start and end symbols:
- ["I", "know", "kung", "fu"]
List all word trigrams for the following sequence of tokens without special start and end symbols:
- ["I", "know", "kung", "fu"]
List all word trigrams for the following sequence of tokens with special start and end symbols:
- ["I", "know", "kung", "fu"]

Footnotes

The same approach taken for another language would result in the same problem. ↩

cd ~/