This lesson provides an introduction to parts of speech.
What is a part of speech?
A part of speech (POS) tag is the category that reflects a word or token's grammatical properties (ex. noun, verb, adjective, adverb, etc.).
Considered together, parts of speech are a shallow dip into syntax, or the underlying grammatical structure of sentences.
Think for a moment how certain kinds of words cannot appear together:
From those examples, we might generalize to sequences of part of speech tags:
NOUN DET
DET NOUN
Even from just a few examples, we can see that certain tag sequences are grammatical and others are ungrammatical.
The grammaticality of a particular tag sequence can vary from language to language. Moreover, certain sequences of tags are more frequent in a particular language. For example, in Russian, adjectives can occur after a noun but more commonly precede nouns.
Interestingly, sentences can be grammatical without conveying any real meaning. Here is a famous example for English:
There are some parts of speech that appear to be universal (present in all attested natural languages), and others that are shared by groups of languages. The category boundaries between different parts of speech can also vary from language to language. For example, in English, there is a clear separation between adverbs and adjectives, but in other languages the boundary can be fuzzy or thin.
One broad distinction we can make when discussing parts of speech is open vs closed classes. Open classes are those parts of speech which more easily accept new members. For example, we frequently see new proper nouns on a daily basis in the form of company or product names. Some of those brands have even been "verbed".
In contrast, function words such as determiners, conjunctions, auxilliary verbs, and adpositions (prepositions in English) comprise a closed class of tags that resist being extended. When was the last time you heard a new conjunction or preposition being used?
Part of speech tags can serve as useful features when training statistical models for ...
If part of speech tags are inaccurate, the models that use them as features are unlikely to be reliable. This is an important consideration when applying tools trained on one domain (ex. news stories) to something quite different (chat room logs).
If you need to develop a part of speech tagger for a particular language or domain (ex. legal documents), you will likely need to gather and annotate data.
Let's examine two popular tagsets...
Devised in the 1990s under the direction of Mitch Marcus and Beatrice Santorini, the University of Pennsylvania's Treebank tagset is an English-centric set of 36 tags that includes granular information related to tense and plurality. For details, including a full list of tags and their descriptions, see the annotation manual:
See this link for information on the tokenization stategies used in the Penn Treebank project.
The Universal Dependencies project aims to provide
[...] a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.
Compared to the PTB tagset, the universal part of speech (UPOS) tagset is more coarse-grained. For instance, verbs in UPOS have no tag subtypes corresponding to different tenses.
For more information, including language-specific examples of each tag, see the documentation:
For descriptions and examples of the tags applied to English, see the the following link:
When annotating parts of speech, there may be times when you enounter an unfamiliar token.
The substitution test can be used to help determine the correct part of speech to assign a token. Consider the following sentence:
Robert frambled quickly.
What is the part of speech for the psuedoword frambled? Subtitute a word with a part of speech you know. For example, using the UPOS tagset, you know that "the" is always labeled as DET
in English:
That doesn't make sense. Let's try a verb:
Warmer...
Bingo! Using the more granular PTB tagset, we'd label frambled as VBD
. With the more coarse-grained UPOS, it would simply be VERB
.