Welcome to Advanced Statistical Natural Language Processing!
This fully online offering of the course has a compressed format with a full semester's worth of content delivered asynchronously over 7.5 weeks.
From the course catalog:
This course focuses on statistical approaches to pattern classification and applications of natural language processing to real-world problems. Class Notes: Course Requisites: LING 539.
The main programming language used in the course will be Python (3.11).
In this course, we will . . .
This is an online, asynchronous course. Content is released in a staggered fashion via the course home page.
Hi! My name is Gus Hahn-Powell. I'll be your instructor for this course.
I'm a computational linguist interested in ways we can use natural language processing to accelerate scientific discovery by mining millions of scholarly documents.
Name | Gus Hahn-Powell |
---|---|
hahnpowell AT arizona DOT edu | |
Office Hours | See our course page on D2L |
Appointments | https://parsertongue.org/availability/ |
This course is meant to follow LING 539. If you have not taken LING 539, you are probably not eligible to take this course. Please contact me to discuss any special circumstances.
While helpful, you don't need a background in linguistics and advanced math to take this course beyond what was covered in LING 539. We'll cover the necessary pieces in class.
In order to take this class, you must be comfortable programming ( 2 semester's worth of programming coursework or equivalent experience). Familiarity with Python and using and defining classes is ideal. If you've never used Python, you'll need to learn the basics to complete the programming assignments (some resources and basic exercises/tutorials will be provided).
Python is an open-source programming language that is widely used in both academia and industry. It has some very useful and popular libraries for linear algebra and machine learning (Numpy, Tensorflow, PyTorch, MXNet, etc.). We'll learn how to use some of these in this course.
A great deal of time can be wasted trying to install and configure software. Most of these issues relate to differences in the starting configuration of users (operating systems, existing software installations, etc.).
While the focus of this class is on statistical natural language processing, we elect to take a bit of time at the beginning of the course to walk everyone through setting up a uniform development environment known to be compatible with all assignments and tutorials used in this course. Adpoting a uniform development environment helps us provide better technical support and a single set of up-to-date instructions.
We feel it is important that the development environment be something freely available to everyone. There are many free and open source distributions of Linux that are lightweight and run on a variety of hardware configurations. We'll be using one of these distributions.
As an added bonus, familiarizing yourself with the technologies we'll use in this course for things like version control and reproducibility may make you more productive in the future.
Normally, we should have assignments graded and posted within a week.
Make sure that you don't wait until the last minute to start your work. Late work will not be accepted.
You are actually required to use an (open source and open weight) LLM for at least one assignment in this course. In any case where an LLM is used, you must prominently cite your usage (model, model version, prompts, etc.).i
If you have questions about the course, I'd prefer you share them through the course forum. If it's something you don't want others to see, you can sending me a direct message.
If the forum is ever down and you need to reach me about the course, you can send me an email with [LING 582]
in the subject line (but think of email as a last resort).
For planning purposes, please note that your instructor responds to posted questions Monday & Friday between 9AM–5PM (MST). Typically, you can expect a response within a day.
From the Content link from the course D2L page, you will find different units referencing the material that you will complete each week of the course. You'll start with Unit 0 where you'll set up your development environment for the course. Due dates will be listed in each unit and in the course calendar. To keep things predictable, all units have the same general structure.
Each unit has links to lessons, videos, readings, and assignments/activities. Be sure to check out the Unit Overview link for each new unit.
The course calendar (accessible from the nav bar) provides a good overview of all due dates. To help avoid missing important deadlines, I recommend that you enter all due dates in the calendar system (ex. Google Calendar, Microsoft Outlook, etc.) you are most comfortable using and set reminders.
You will be able to access your grades via the D2L Grades tab.
We don't take attendance in this course.
This is a common misconception that online classes are inherently easier than face-to-face classes. In actuality, they are quite different in structure, pacing, and the way that you have to manage your time in order to be successful. While you will be meeting the same outcomes and learning the same material as a face-to-face class, the way that you participate changes. You lose the synchronous element (i.e., sitting in a chair in a room on campus with me and your classmates), but you gain a good deal more asynchronous reading, writing, listening, thinking, and responding.
Note too that this is a 3-unit 7.5-week class. That means that the expectations for content and workload are the same as a 3-unit 15-week class, but in half the time.
The assignments and tutorials are designed to be run using docker in a Linux or Darwin (MacOS) environment. The first unit of the course guides you through configuring your development environment. While it may be possible to run the assignments on Windows or another OS with some modifications, technical support will only be provided for the Linux environment introduced in the first unit. If you decide to use a virtual machine (as demonstrated in the first unit), we recommend you run this on a system with a minimum of 8GB of RAM and 50GB of storage. If you opt to install Linux natively (not required), you can probably get by with half the amount of RAM.
If you plan to run an LLM locally, you'll likely want to use a machine with 24GB+ of RAM.
To train (deep) neural networks, you'll generally want dedicated access to one or more powerful GPUs. I recommend using a remote environment set up for this purpose. There are many options (AWS, Paperspace, UA HPC, etc), but the only such environment freely available to all students at the University of Arizona is the UA HPC. While most organizations/employers outside of academia are likely using a major cloud computing provider (ex. AWS, GCP, Azure, etc) or hosting their own OpenStack environment, this course will focus on computing resources available at no cost. In the interest of equity, I'll provide tutorials and a project template for GPU-accelerated training of a neural network on the UA HPC.