Overview

In this lesson, we'll learn about nn-grams and see how they can be used as general features to represent words and documents.

Outcomes

After completing this lesson, you'll be able to ...

  • identify character nn-grams of various sizes (1, 2, 3, ...)
  • identify word nn-grams of various sizes (1, 2, 3, ...)
  • use nn-grams as features to represent words and documents

Prerequisites

Before you start, ...:

Sequences of characters and tokens

As we discussed in the previous lesson, using sequences of characters and tokens as features is a common practice in NLP. Rather than limiting such features to a shortlist of specific morphemes in the form of prefixes or suffixes, we can take a general approach and include all character or token sequences of some length nn. These sequences of length nn tokens or characters are known as nn-grams:

nnName
1unigrams
2bigrams
3trigrams

Beyond 3, these are simply referred to using the value of nn that is selected (ex. 4-grams).

In practice, it is not unusual to use the union of features for multiple values of nn (ex. (1-grams โˆช\cup 2-grams โˆช\cup 3-grams)).

How do we come up with the set of all nn-grams of a particular length? For example, if we wanted to generate all character bigrams, what approach should we take?

A naรฏve approach

Your first intuition might be that we can just generate all possible sequences of length 2 based on some alphabet ฮฃ\Sigma of symbols (i.e., the Cartesian product), and then check which can be found in each word we want to represent. In English, assuming we case fold, that alphabet might be the letters a-z along with digits and some punctuation (including a whitespace).

What are the problems with this approach?

For one, it ends up being a large set of features:

V=โˆฃฮฃโˆฃnV = |\Sigma|^{n} where nn is the length of the sequence we want to use.

In other words, any character can follow any character. For each of the nn letters in our alphabet, there are nn possible pairs. If this isn't clear, try manually generating all possible sequences for a very small alphabet:

ฮฃ={a,b}\Sigma = \{a, b\}

# all possible pairs of "b" & "b"
naive_char_bigrams = [
  ("a", "a"), 
  ("a", "b"), 
  ("b", "a"), 
  ("b", "b")
]

# alternatively, ...
from itertools import product
list(product(["a","b"], repeat=2))

# SPOILER ALERT: they're the same!
set(naive_char_bigrams) == set(product(["a","b"], repeat=2))

โˆฃฮฃโˆฃ=2โ†’22=4|\Sigma| = 2 \rightarrow 2^{2} = 4

Let's consider the case where ฮฃ\Sigma is comprised of only 26 letters (a-z). How many character bigrams can we generate?

from itertools import product
from string import ascii_lowercase

# count bigrams using cartesian product
total = sum(1 for bigram in product(ascii_letters, repeat=2))
print(total)

262=67626^{2} = 676

Problem: As nn increases, this number gets very big!

nnTotal character nn-grams (assuming ฮฃ=26\Sigma = 26)
126
2676
317576
4456976

โ”—|ใƒปoใƒป|โ”› โ†’\rightarrow ๅทฅ๏ฝด๏ฝด๏ฝช๏ฝช(;โ•นโŒ“โ•น)๏ฝช๏ฝช๏ฝด๏ฝดๅทฅ

There is another problem, though, which is that many of these sequences are not phonotactically valid.

Problem: We inadvertently generate many sequences that will never occur in text.

Sequences like zg, gz, qm, etc. are just not possible in English1.

Instead of generating all (im)possible sequences, why not just generate the set of all observed sequences of some length?

Observed nn-grams only

In order to generate only the observed nn-grams, we need to take one pass over all of our data. For example, if wanted to generate character trigrams to represent 200 words, we need to do the following:

For each word, construct a list of nn-grams by sliding an nn-sized window across our the word from left to right. Add each nn-gram in the result to a growing set of all nn-grams.

# NOTE: the implementation of character_ngrams 
# is left as an exercise for the reader


# our set of all features.
all_features = {}
for word in words: 
  for ngram in character_ngrams(word, n=2):
    all_features.add(ngram)

What do the bigrams look like for a single word?

res = character_ngrams("vivid", n=2)
print(res)
["vi", "iv", "vi", "id"]

One variation of this algorithm prepends and appends a special start and end symbol to each word to better detect word boundaries:

res = character_ngrams("vivid", n=2, with_start_end=True)
print(res)
["<S>v", "vi", "iv", "vi", "id", "d</S>"]

Why might this be useful?

Values for nn-gram features

When using nn-grams as features, one can elect to use binary values, counts, or (as we'll see in the next unit) probabilities.

Let's look at an example of binary and count-based representations using the previous example:

res = character_ngrams("vivid", n=2)
print(res)
["vi", "iv", "vi", "id", "d"]

Binary values

Word...viividgoos...
vivid...11100...

The dots represent omitted nn-grams observed in other words in our data (ex. "mo" from "mom", etc.).

Count-based values

Word...viividgoos...
vivid...21100...

Remember, we saw the bigram vi twice in vivid.

from collections import Counter

counts = Counter(character_ngrams("vivid", n=2))
print(counts.most_common())

Characters vs words

The algorithm for generating nn-grams works the same for characters or words/tokens. Just as a document can be conceived of as a sequence of tokens, you can think of a word as a sequence of characters.

Next steps

You now understand the basics of using nn-grams as features to represent word and document vectors. Let's practice ...

Practice

Right โ†’\rightarrow left?

  • The nn-gram algorithm described slides a window across text from left to right. Can you think of a language where you'd want to slide from right to left?

Problems with products

  • Describe some of the problems with using the Cartesian product to generate all possible sequences of some length nn? What is a practical alternative?

Why use <S> and </S>?

Recall that a popular variation of the nn-gram algorithm prepends and appends special characters to recognize nn-grams constituting the start and end of the sequence. Why might this be useful?

Token bigrams

  • List all word bigrams for the following sequence of tokens without special start and end symbols:

    • ["I", "know", "kung", "fu"]
  • List all word bigrams for the following sequence of tokens with special start and end symbols:

    • ["I", "know", "kung", "fu"]
  • List all word trigrams for the following sequence of tokens without special start and end symbols:

    • ["I", "know", "kung", "fu"]
  • List all word trigrams for the following sequence of tokens with special start and end symbols:

    • ["I", "know", "kung", "fu"]

Footnotes


  1. The same approach taken for another language would result in the same problem. โ†ฉ
cd ~/Creative Commons License