Unit | Topic | Readings & Resources |
Getting started | - using Virtualbox to install a Linux VM
- version control with Git and GitHub
- Linux command line basics
- containers with Docker
- Jupyter
| Unit tutorials |
Tokens and their attributes | - tokens and selected attributes
- part of speech tags
- canonical word forms (lemmas)
- named entity labels
- text normalization
- cross-linguistic differences in defining tokens and their attributes
| |
Tokenization and regular expressions | - regular expressions (regexes)
- representing a state machine with a regular expression
- representing a regular expression as a state machine
- tokenization using patterns (regexes)
| |
Vector representations of words and documents | - feature engineering
- feature vectors
- feature vocabularies
- n-grams (character-level and word-level)
| |
Probability basics | - the basics of probability
- joint probabilities
- conditional probabilities
- marginal (simple) probabilities
- likelihood of sequences
- probabilities in log space
| |
Comparing vectors | - direction and magnitude
- vector operations
- vector normalization (L2 norm)
- distance and similarity metrics
- dot products
- euclidean distance
- cosine similarity
- centroids
- medoids
| |