Overview

Welcome to Intro to NLP: Representing words and documents as vectors. This is an introductory-level NLP tutorial for Resbaz 2022!

In this short workshop (1-2 hours), we'll look at how to represent words and documents as vectors and compare them. These representations can be used to cluster information or train statistical classifiers for various tasks.

Description

Natural Language Processing (NLP) is an applied field of study at the intersection of linguistics, computer science, and machine learning that examines automated ways of making sense of natural language.

NLP tools are all around us.

Chatbots		Chat with Gandalf (GPT-based demo) ...
Virtual assistants		Siri Alexa Google Assistant Mycroft ...
SPAM filters		Email (demo)
Detecting and moderating hate speech		Demo
Machine Translation		English to Japanese (with honorifics)
Summarization and simplification		ELI5, create accessible resources for learners, etc.
Voice cloning		Read Harry Potter and the Philosopher's Stone in the voice of Arnold Schwarzenegger Tacotron2 HiFi GAN
Sentiment analysis		Was that tweet a compliment or complaint about your product/company?
Search		What happens when you search for "sneakers" on Google?

NLP is a very broad field involving text, audio (speech), images (handwriting, layout analysis, etc.), and video (ex. signed languages) data for all of the world's languages (extant and extinct). There are many different ways to represent this data. In this workshop, we'll introduce some foundational concepts for 1) representing text data by engineering features to create vector-based representations of words and documents, as well as 2) methods for comparing such representations.

Aside from our everyday life, NLP has made its way into just about every industry, including medicine, finance, advertising, defense, and gaming.

Prerequisites

This workshop is meant to be accessible to people without a background in programming or advanced math (everything you need to get started you already covered in high school or earlier).

To get the most out of this workshop, you should be comfortable with the basics of programming (ideally in Python) and have a working Docker installation.

If you're already familiar with the basics of programming in Python, you'll be able to follow along with the provided examples (all examples use Python 3.8).

Objectives

By the end of this workshop, you will be able to ...

represent words and documents as vectors
generate character and word $n$ -grams
compare vectors to find similar items

Location and Times

This workshop is completely virtual.

Author

Hi! My name is Gus Hahn-Powell.

I'm a computational linguist interested in ways we can use natural language processing to accelerate scientific discovery by mining millions of scholarly documents.

Name	Gus Hahn-Powell
Email	`hahnpowell AT arizona DOT edu`
Appointments	https://parsertongue.org/availability

Tutorials

note

Complete these tutorials in the order listed:

Supplemental

Ready to speed up your comparisons? Start by familiarizing yourself with the NumPy library for fast numerical computing:

Practice

Once you've completed the above tutorials, review and practice what we've covered.

Word embeddings

While this tutorial looked at ways of engineering features for word and document vectors, it's also possible to learn representations.¹

Apply what you've learned in this workshop to explore a set of pre-trained word embeddings.

Load the word embeddings into a dictionary that maps each word to a numpy array.
Using cosine similarity as your metric, what are the 10 most-similar words to "dog"?
Average the embeddings for "pizza" and "pineapple". Using cosine similarity as your metric, what are the 10 most-similar words to this averaged embedding?
For each of the following sentences, sum the embeddings for each word in the sentence:

That bodacious green teen with the nunchaku is a ninja turtle.
I glimpsed a stealthy shinobi slip silently through the shadows.
Turtle soup is not something you want to slurp in the company of friendly reptiles.
Do you put pineapple on your pizza?
Finish your fettuccine alfredo before taking a bite of your canoli.

Using cosine similarity as your metric, what sentence is most similar to I like turtles?

Repeat your experiment by averaging the embeddings, rather than summing. What do you notice?

Next steps

Interested in learning more? Are you a UA student? Consider taking LING 529 (HLT I), LING 539 (Intro to Statistical NLP), and/or LING 582 (Advanced Statistical NLP).

All three are offered as 7.5-week asychronous online courses as part of our online MS in Human Language Technology.

Footnotes

If you're curious to understand how and why you might want to learn such representations, consider taking LING 539. ↩

cd ~/