Welcome to Intro to NLP: Representing words and documents as vectors. This is an introductory-level NLP tutorial for Resbaz 2022!
In this short workshop (1-2 hours), we'll look at how to represent words and documents as vectors and compare them. These representations can be used to cluster information or train statistical classifiers for various tasks.
Natural Language Processing (NLP) is an applied field of study at the intersection of linguistics, computer science, and machine learning that examines automated ways of making sense of natural language.
NLP tools are all around us.
Chatbots | ||
Virtual assistants | ||
SPAM filters | ||
Detecting and moderating hate speech | ||
Machine Translation | ||
Summarization and simplification | ELI5, create accessible resources for learners, etc. | |
Voice cloning | Read Harry Potter and the Philosopher's Stone in the voice of Arnold Schwarzenegger | |
Sentiment analysis | Was that tweet a compliment or complaint about your product/company? | |
Search |
NLP is a very broad field involving text, audio (speech), images (handwriting, layout analysis, etc.), and video (ex. signed languages) data for all of the world's languages (extant and extinct). There are many different ways to represent this data. In this workshop, we'll introduce some foundational concepts for 1) representing text data by engineering features to create vector-based representations of words and documents, as well as 2) methods for comparing such representations.
Aside from our everyday life, NLP has made its way into just about every industry, including medicine, finance, advertising, defense, and gaming.
This workshop is meant to be accessible to people without a background in programming or advanced math (everything you need to get started you already covered in high school or earlier).
To get the most out of this workshop, you should be comfortable with the basics of programming (ideally in Python) and have a working Docker installation.
If you're already familiar with the basics of programming in Python, you'll be able to follow along with the provided examples (all examples use Python 3.8).
By the end of this workshop, you will be able to ...
This workshop is completely virtual.
Hi! My name is Gus Hahn-Powell.
I'm a computational linguist interested in ways we can use natural language processing to accelerate scientific discovery by mining millions of scholarly documents.
Name | Gus Hahn-Powell |
---|---|
hahnpowell AT arizona DOT edu | |
Appointments | https://calendar.parsertongue.com |
Complete these tutorials in the order listed:
Ready to speed up your comparisons? Start by familiarizing yourself with the NumPy library for fast numerical computing:
Once you've completed the above tutorials, review and practice what we've covered.
While this tutorial looked at ways of engineering features for word and document vectors, it's also possible to learn representations.1
Apply what you've learned in this workshop to explore a set of pre-trained word embeddings.
Load the word embeddings into a dictionary that maps each word to a numpy array.
Using cosine similarity as your metric, what are the 10 most-similar words to "dog"?
Average the embeddings for "pizza" and "pineapple". Using cosine similarity as your metric, what are the 10 most-similar words to this averaged embedding?
For each of the following sentences, sum the embeddings for each word in the sentence:
Using cosine similarity as your metric, what sentence is most similar to I like turtles?
Repeat your experiment by averaging the embeddings, rather than summing. What do you notice?
Interested in learning more? Are you a UA student? Consider taking LING 529 (HLT I), LING 539 (Intro to Statistical NLP), and/or LING 582 (Advanced Statistical NLP).
All three are offered as 7.5-week asychronous online courses as part of our online MS in Human Language Technology.