Course Description

This course addresses linguistic research questions through data science techniques. The course will focus on developing skills to (i) acquire and process a variety of language data — from using established corpora to data "in the wild" from the internet — and (ii) to investigate language use through exploratory data analysis techniques and inferential statistics. This course will introduce a range of tools for working with linguistic data, such as Unix commands, Python, NLTK, NumPy, Pandas, R, and ggplot. A significant portion of the course will be devoted to applying these tools and skills in hands-on projects.

Days Time Venue
Monday and Wednesday 9:00-10:15am In-person

Teaching Staff

Role Name Office Hours
Instructor C.M. Downey Thursdays 10-11am

"Required" Textbooks

Required readings are included in the course schedule below. These readings will be drawn exclusively from resources that are either freely available online (i.e. as a pdf or website), or that are accessible online through the UR library. In brackets are the abbreviations we will use to refer to these resources in the schedule.

Policies

Homework

Students will complete 6-8 homeworks, comprised of both written questions and small programming assignments. All homework will be submitted via Blackboard. Homeworks will be worth 40% of the total grade.

Projects

Students will complete one midterm and one final project. The midterm project will be worth 20% and the final project 30% of the final grade. These projects will be larger and more open-ended than the homeworks, prompting students to apply data analysis techniques to real-world language data, and write up the results.

Deadlines and late work

All deadlines and meeting times for this class are in "Eastern Time". Please note: on Sunday March 9, this will change from Eastern Standard Time (EST/UTC-5) to Eastern Daylight Time (EDT/UTC-4). All work should be submitted by 11:00pm the day it is due. Work that is received late will incur the following penalties:

  • Up to 1 hour late: 5%
  • Up to 24 hours late: 10%
  • Up to 48 hours late: 20%
  • Later than 48 hours: not graded (0 for the assignment)

Extensions (without penalty) may be offered if they are requested within a reasonable amount of time (relative to the reason for the extension) before the work is due. Please don't hesitate to ask for an extension if you need one.

Attendance and participation

Because of the hands-on nature of this class, both attendance and participation are important. I will keep track of class attendance. Students are allowed to be absent from up to four sessions for any reason, without needing to contact me, and without penalty. These excused absences may be used for travel, illness, catching up with other courses, etc. However, unexecused absences beyond these four sessions will count against the student's final attendance grade, with the exception of important obligations listed below. Attendance will be worth 5% of the final grade.

Much of class time will be devoted to live demonstrations of software tools that students are encouraged to follow along with, as well as "workshop" days during which students will be given time to work on their projects and receive live feedback from the instructor. Becuase of this, students are expected to bring a laptop or other device to each class, on which they can develop and test data-intensive software. Please contact me as soon as possible if you do not have an appropriate device. Participation will be worth 5% of the final grade.

Final grading

  • 40%: homework assignments
  • 30%: final project
  • 20%: midterm project
  • 5%: attendance
  • 5%: participation

Exceptions

Students will not be penalized because of important civic, ethnic, family or religious obligations, or university service. You will have a chance, whenever feasible, to make up within a reasonable time any assignment that is missed for these reasons. Absences for these reasons will count as excused for the sake of the participation grade. But it is your job to inform me of any expected missed work in advance, as soon as possible.

Academic honesty

All assignments and activities associated with this course must be performed in accordance with the University of Rochester's Academic Honesty Policy. More information is available here. Please note: The use of Generative AI to produce any part of written/essay (prose) assignments is not allowed. Students may use Generative AI to assist in programming only. However, students are fully responsible for the success or failure of their code.

Schedule


Date Topics + Slides Readings Events
Jan 22 Introduction / Overview Finish setup guide before Friday
Windows, Mac
Jan 24 (Friday) Computer setup; IDE; Shell commands
Jan 27 Introducing Python NLTK Chapter 1
Sections 1.1-1.2, 1.4, 2, 4
Ignore book's instructions for starting up Python
Jan 29 More Python; Anaconda & Python packages
Feb 3 Regular Expressions SLP Chapter 2
Sections 2.0-2.3
(i.e. read through 2.3, stopping before 2.4)
Feb 5 Advanced RegEx
(Substitution, Capture Groups)
hw1 released
Zip file
(due 2/12)
Feb 10 class cancelled
Feb 12 NLTK
(Corpus methods)
NLTK Chapter 2 hw1 due
Feb 17 More NLTK NLTK Chapter 3: 3.3, 3.5-3.7
NLTK Chapter 5: 1-2, 4-5
hw2 released
Markdown file
Feb 19 Git & Github
Feb 24 Review Git;
Common data files
hw2 due
Feb 26 Linguistic data annotation
Mar 3-5 Midterm project workshop Midterm project due March 7
Mar 10-12 Spring Break: no class
Mar 17 Basics of R hw3 released on blackboard
(due 3/23)
Mar 19 Descriptive statistics Learning Statistics with R
3.2-3.9 (skim), 3.10, 4.5.3-4.5.4, 2.2, 4.6-4.9
Mar 24 Correlation LSR Ch. 5, 9.0-9.3
Mar 26 Probability distributions LSR Ch. 9.4-9.7, 10.0-10.3 hw4 released
Markdown file
Mar 31 Hypothesis testing LSR Ch. 11.0-11.8
Apr 2 t-tests LSR Ch. 13.0-13.4 hw4 due
Apr 7 ANOVA, Linear regression LSR Ch. 14.0-14.3, 15.0-15.8
Apr 9 Data visualization with ggplot
Apr 14 Interactions, polynomial regression, and mixed-effects
Apr 16 Bonus topic: word vectors
Apr 21-23 Final project workshop TBD
Apr 28-30 Final project presentations; Course wrap-up TBD