Class Preparation

To prepare for each unit, complete the readings and activities noted in the table below.

UnitTopicReadings & Resources

Getting started

  • using Virtualbox to install a Linux VM
  • version control with Git and GitHub
  • Linux command line basics
  • containers with Docker
  • Jupyter

Unit tutorials

Tokens and their attributes

  • tokens and selected attributes
    • part of speech tags
    • canonical word forms (lemmas)
    • named entity labels
  • text normalization
  • cross-linguistic differences in defining tokens and their attributes

Tokenization and regular expressions

  • regular expressions (regexes)
  • representing a state machine with a regular expression
  • representing a regular expression as a state machine
  • tokenization using patterns (regexes)

Vector representations of words and documents

  • feature engineering
  • feature vectors
  • feature vocabularies
  • n-grams (character-level and word-level)

Probability basics

  • the basics of probability
    • joint probabilities
    • conditional probabilities
    • marginal (simple) probabilities
  • likelihood of sequences
    • language models
  • probabilities in log space

Comparing vectors

  • direction and magnitude
  • vector operations
  • vector normalization (L2 norm)
  • distance and similarity metrics
    • dot products
    • euclidean distance
    • cosine similarity
  • centroids
  • medoids
