Class Preparation

To prepare for each unit, complete the readings and activities noted in the table below.

Unit	Topic	Readings & Resources
Getting started	using Virtualbox to install a Linux VM version control with Git and GitHub Linux command line basics containers with Docker Jupyter	Unit tutorials
Tokens and their attributes	tokens and selected attributes part of speech tags canonical word forms (lemmas) named entity labels text normalization cross-linguistic differences in defining tokens and their attributes
Tokenization and regular expressions	regular expressions (regexes) representing a state machine with a regular expression representing a regular expression as a state machine tokenization using patterns (regexes)
Vector representations of words and documents	feature engineering feature vectors feature vocabularies n-grams (character-level and word-level)
Probability basics	the basics of probability joint probabilities conditional probabilities marginal (simple) probabilities likelihood of sequences language models probabilities in log space
Comparing vectors	direction and magnitude vector operations vector normalization (L2 norm) distance and similarity metrics dot products euclidean distance cosine similarity centroids medoids