Welcome to Intro to NLP: Representing words and documents as vectors. This is an introductory-level NLP tutorial for Resbaz 2022!

In this short workshop (1-2 hours), we'll look at how to represent words and documents as vectors and compare them. These representations can be used to cluster information or train statistical classifiers for various tasks.


Natural Language Processing (NLP) is an applied field of study at the intersection of linguistics, computer science, and machine learning that examines automated ways of making sense of natural language.

NLP tools are all around us.


Virtual assistants

SPAM filters

Detecting and moderating hate speech

Machine Translation

Summarization and simplification

ELI5, create accessible resources for learners, etc.

Voice cloning

Read Harry Potter and the Philosopher's Stone in the voice of Arnold Schwarzenegger

Sentiment analysis

Was that tweet a compliment or complaint about your product/company?


What happens when you search for "sneakers" on Google?

NLP is a very broad field involving text, audio (speech), images (handwriting, layout analysis, etc.), and video (ex. signed languages) data for all of the world's languages (extant and extinct). There are many different ways to represent this data. In this workshop, we'll introduce some foundational concepts for 1) representing text data by engineering features to create vector-based representations of words and documents, as well as 2) methods for comparing such representations.

Aside from our everyday life, NLP has made its way into just about every industry, including medicine, finance, advertising, defense, and gaming.


This workshop is meant to be accessible to people without a background in programming or advanced math (everything you need to get started you already covered in high school or earlier).

To get the most out of this workshop, you should be comfortable with the basics of programming (ideally in Python) and have a working Docker installation.

If you're already familiar with the basics of programming in Python, you'll be able to follow along with the provided examples (all examples use Python 3.8).


By the end of this workshop, you will be able to ...

  • represent words and documents as vectors
  • generate character and word nn-grams
  • compare vectors to find similar items

Location and Times

This workshop is completely virtual.


Hi! My name is Gus Hahn-Powell.

I'm a computational linguist interested in ways we can use natural language processing to accelerate scientific discovery by mining millions of scholarly documents.

NameGus Hahn-Powell
Emailhahnpowell AT arizona DOT edu



Complete these tutorials in the order listed:

  1. Representing words & documents

  2. nn-grams

  3. Vector basics

  4. Comparing vectors


Ready to speed up your comparisons? Start by familiarizing yourself with the NumPy library for fast numerical computing:


Once you've completed the above tutorials, review and practice what we've covered.

Word embeddings

While this tutorial looked at ways of engineering features for word and document vectors, it's also possible to learn representations.1

Apply what you've learned in this workshop to explore a set of pre-trained word embeddings.

  1. Load the word embeddings into a dictionary that maps each word to a numpy array.

  2. Using cosine similarity as your metric, what are the 10 most-similar words to "dog"?

  3. Average the embeddings for "pizza" and "pineapple". Using cosine similarity as your metric, what are the 10 most-similar words to this averaged embedding?

  4. For each of the following sentences, sum the embeddings for each word in the sentence:

  • That bodacious green teen with the nunchaku is a ninja turtle.
  • I glimpsed a stealthy shinobi slip silently through the shadows.
  • Turtle soup is not something you want to slurp in the company of friendly reptiles.
  • Do you put pineapple on your pizza?
  • Finish your fettuccine alfredo before taking a bite of your canoli.

Using cosine similarity as your metric, what sentence is most similar to I like turtles?

Repeat your experiment by averaging the embeddings, rather than summing. What do you notice?

Next steps

Interested in learning more? Are you a UA student? Consider taking LING 529 (HLT I), LING 539 (Intro to Statistical NLP), and/or LING 582 (Advanced Statistical NLP).

All three are offered as 7.5-week asychronous online courses as part of our online MS in Human Language Technology.


  1. If you're curious to understand how and why you might want to learn such representations, consider taking LING 539.
cd ~/Creative Commons License