Overview

This lesson provides an introduction to parts of speech.

Outcomes

  • identify parts of speech
  • identify two popular tagsets
  • apply tests for identifying a part of speech

Background

What is a part of speech?

A part of speech (POS) tag is the category that reflects a word or token's grammatical properties (ex. noun, verb, adjective, adverb, etc.).

Considered together, parts of speech are a shallow dip into syntax, or the underlying grammatical structure of sentences.

Think for a moment how certain kinds of words cannot appear together:

dog the
✔️ the dog
cat any
✔️ any cat

From those examples, we might generalize to sequences of part of speech tags:

NOUN DET
✔️ DET NOUN

Even from just a few examples, we can see that certain tag sequences are grammatical and others are ungrammatical.

The grammaticality of a particular tag sequence can vary from language to language. Moreover, certain sequences of tags are more frequent in a particular language. For example, in Russian, adjectives can occur after a noun but more commonly precede nouns.

Interestingly, sentences can be grammatical without conveying any real meaning. Here is a famous example for English:

✔️ colorless green ideas sleep furiously

There are some parts of speech that appear to be universal (present in all attested natural languages), and others that are shared by groups of languages. The category boundaries between different parts of speech can also vary from language to language. For example, in English, there is a clear separation between adverbs and adjectives, but in other languages the boundary can be fuzzy or thin.

Open class vs closed class categories

One broad distinction we can make when discussing parts of speech is open vs closed classes. Open classes are those parts of speech which more easily accept new members. For example, we frequently see new proper nouns on a daily basis in the form of company or product names. Some of those brands have even been "verbed".

In contrast, function words such as determiners, conjunctions, auxilliary verbs, and adpositions (prepositions in English) comprise a closed class of tags that resist being extended. When was the last time you heard a new conjunction or preposition being used?

What are part of speech tags used for?

Part of speech tags can serve as useful features when training statistical models for ...

  • syntactic parsing
  • chunking (shallow parsing)
  • named entity recognition and information extraction (events, relations, etc.)
  • sentiment analysis
  • authorship detection
  • etc.

If part of speech tags are inaccurate, the models that use them as features are unlikely to be reliable. This is an important consideration when applying tools trained on one domain (ex. news stories) to something quite different (chat room logs).

Tagsets

If you need to develop a part of speech tagger for a particular language or domain (ex. legal documents), you will likely need to gather and annotate data.

Let's examine two popular tagsets...

Penn Treebank (PTB)

Devised in the 1990s under the direction of Mitch Marcus and Beatrice Santorini, the University of Pennsylvania's Treebank tagset is an English-centric set of 36 tags that includes granular information related to tense and plurality. For details, including a full list of tags and their descriptions, see the annotation manual:

See this link for information on the tokenization stategies used in the Penn Treebank project.

Universal Parts of Speech (UPOS)

The Universal Dependencies project aims to provide

[...] a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.

Compared to the PTB tagset, the universal part of speech (UPOS) tagset is more coarse-grained. For instance, verbs in UPOS have no tag subtypes corresponding to different tenses.

For more information, including language-specific examples of each tag, see the documentation:

For descriptions and examples of the tags applied to English, see the the following link:

Determining a part of speech

When annotating parts of speech, there may be times when you enounter an unfamiliar token.

Substitution test

The substitution test can be used to help determine the correct part of speech to assign a token. Consider the following sentence:

Robert frambled quickly.

What is the part of speech for the psuedoword frambled? Subtitute a word with a part of speech you know. For example, using the UPOS tagset, you know that "the" is always labeled as DET in English:

Robert the quickly.

That doesn't make sense. Let's try a verb:

😑 Robert walk quickly.

Warmer...

😄 Robert walked quickly.

Bingo! Using the more granular PTB tagset, we'd label frambled as VBD. With the more coarse-grained UPOS, it would simply be VERB.

👀 When discussing parsing in LING 539, we'll see the same sort of test applied to constituency.

Next steps

Practice

  • Describe some of the differences between PTB and UPOS
  • What is the PTB tag for past tense verbs?
  • Give an example of a closed class part of speech and word that belongs to that category
cd ~/Creative Commons License