How do you go from raw linguistic data to a scientific conclusion? This course bridges that gap. Students will work hands-on with the data formats linguists actually encounter—corpora, treebanks, speech recordings, annotation schemes—and learn to process, explore, and analyze them. Topics include text processing with Python and NLTK, version control with Git, and statistical analysis with R. The course culminates in a term project where students investigate a linguistic research question using real data.
Basic experience with Python programming is recommended. Students without prior Python experience should be prepared to do additional work early in the semester to get up to speed.
| Days | Time | Venue |
|---|---|---|
| Monday and Wednesday | 9:00-10:15am | In-person |
| Role | Name | Office Hours |
|---|---|---|
| Instructor | C.M. Downey | By appointment |
Required readings are included in the course schedule below. These readings will be drawn exclusively from resources that are either freely available online (i.e. as a pdf or website), or that are accessible online through the UR library. In brackets are the abbreviations we will use to refer to these resources in the schedule.
Students will complete several homeworks comprised of both written questions and programming assignments. All homework will be submitted via Blackboard. Homeworks will be worth 20% of the total grade.
Students will complete a midterm project involving a structured data processing task. This project will have precise specifications and is designed to give students hands-on experience working with linguistic data formats covered in the first half of the course. The midterm project is worth 20% of the final grade.
Students will complete a substantial term project applying data science techniques from this course to a linguistic research question of their choosing. Projects may be completed individually or in pairs.
Your project should investigate a research question about language using real data. This could involve exploring patterns in a corpus, analyzing phonetic measurements, comparing annotation schemes, or testing a hypothesis about language use. The key is that you're using data to answer a question—not just demonstrating a technique.
Deliverables include a final presentation to the class and a writeup describing your research question, data, methods, and findings. Progress will be tracked through milestones throughout the semester (see schedule for dates; detailed requirements for each milestone will be linked there).
These illustrate the form of a valid project, not prescribed topics:
The term project is worth 40% of the final grade.
All deadlines and meeting times for this class are in "Eastern Time". Please note: on Sunday March 8, this will change from Eastern Standard Time (EST/UTC-5) to Eastern Daylight Time (EDT/UTC-4). All work should be submitted by 11:00pm the day it is due. Work that is received late will incur the following penalties:
Extensions (without penalty) may be offered if they are requested within a reasonable amount of time (relative to the reason for the extension) before the work is due. Please don't hesitate to ask for an extension if you need one.
Because of the hands-on nature of this class, attendance is important. I will keep track of class attendance. Students are allowed to be absent from up to two sessions for any reason, without needing to contact me, and without penalty. These excused absences may be used for travel, illness, catching up with other courses, etc. However, unexcused absences beyond these two sessions will count against the student's final attendance grade, with the exception of important obligations listed below. Attendance will be worth 10% of the final grade.
Much of class time will be devoted to demonstrations and hands-on exercises, as well as workshop days during which students will work on their projects and receive feedback from the instructor. Because of this, students are expected to bring a laptop or other device to each class, on which they can develop and test data-intensive software. Please contact me as soon as possible if you do not have an appropriate device.
Participation will be worth 10% of the final grade, based on:
Students will not be penalized because of important civic, ethnic, family or religious obligations, or university service. You will have a chance, whenever feasible, to make up within a reasonable time any assignment that is missed for these reasons. Absences for these reasons will count as excused for the sake of the participation grade. But it is your job to inform me of any expected missed work in advance, as soon as possible.
All assignments and activities associated with this course must be performed in accordance with the University of Rochester's Academic Honesty Policy. More information is available here. Please note: The use of Generative AI to produce written assignments or project reports/writeups is not allowed. Generative AI is allowed for programming only.
This policy operates on the honor system. The goal of this course is for you to develop real expertise that you'll need for research, job interviews, and your career. If you outsource your thinking to AI, you mostly shortchange yourself. That said, there's a difference between having AI produce your work and using AI as a tool to help you learn. Using AI to clarify a concept, explore an idea, or deepen your understanding is fine. Using it to generate text or analysis that you then submit as your own is not.
| Date | Topics + Slides | Readings | Events |
|---|---|---|---|
| Jan 21 | Introduction / Overview | Finish setup guide before Friday Windows, Mac | |
| Jan 23 (Friday) | Computer setup, IDE, Shell commands | ||
| Jan 26 | Snow day: no class | ||
| Jan 28 | Python review; Anaconda & packages | ||
| Feb 2 | Regular Expressions |
SLP Chapter 2 Sections 2.0-2.3 |
hw1 released
(due 2/9) |
| Feb 4 | NLTK (Corpus methods) | NLTK Chapter 2 | |
| Feb 9 | More NLTK |
NLTK Chapter 3: 3.3, 3.5-3.7
NLTK Chapter 5: 1-2, 4-5 |
hw1 due
hw2 released |
| Feb 11 | Git & GitHub |
Project M1 due
(Interest survey) |
|
| Feb 16 | Common data formats | hw2 due | |
| Feb 18 | Annotated text I (treebanks, tagged corpora) | ||
| Feb 23 | Annotated text II (UD, annotation schemes) | ||
| Feb 25 | Speech data I (waveform, sampling, digitization) | ||
| Mar 2 | Speech data II (spectrograms, formants, Praat/ELAN) |
Project M2 due
(Two proposals) |
|
| Mar 4 | Term project round-robin | Midterm project due March 6 | |
| Mar 9-11 | Spring Break: no class | ||
| Mar 16 | Computational corpus analysis | ||
| Mar 18 | Basics of R |
Project M3 due
(Abstract + plan) hw3 released (due 3/30) |
|
| Mar 23 | Descriptive statistics |
Learning Statistics with R
3.2-3.9 (skim), 3.10, 4.5.3-4.5.4, 2.2, 4.6-4.9 |
|
| Mar 25 | Correlation | LSR Ch. 5, 9.0-9.3 | |
| Mar 30 | Probability distributions | LSR Ch. 9.4-9.7, 10.0-10.3 |
hw3 due
hw4 released |
| Apr 1 | Hypothesis testing | LSR Ch. 11.0-11.8 |
Project M4 due
(Progress check) |
| Apr 6 | t-tests | LSR Ch. 13.0-13.4 | hw4 due |
| Apr 8 | ANOVA, Linear regression | LSR Ch. 14.0-14.3, 15.0-15.8 | |
| Apr 13 | Data visualization & the Grammar of Graphics | ||
| Apr 15 | Interactions, polynomial regression, and mixed-effects | ||
| Apr 20 | Peer-review workshop | ||
| Apr 22 | Data publishing and open access | ||
| Apr 27-29 | Term project presentations; Course wrap-up |
Writeup due
(Finals week) |
|