Many Machine Learning courses assume you have access to large, labeled datasets. But what happens when you don't? Real-world applications routinely face severe data constraints: rare diseases have few documented cases, new products lack historical data, endangered species have limited observations, and most of the world's languages have minimal digital resources. How can we build effective ML systems when data is scarce?
This course will introduce state-of-the-art techniques for data-efficient Machine Learning, focusing on three core areas: (1) transfer learning and domain adaptation, (2) learning with limited supervision (semi-supervised, self-supervised, and active learning), and (3) few-shot learning and data augmentation strategies. Students will engage with recent research literature and gain hands-on experience applying these methods across diverse domains.
Assessment is based on a substantial term project where students implement and evaluate data-efficient techniques on a problem of their choosing, along with smaller assignments throughout the semester.
| Days | Time | Location |
|---|---|---|
| Tuesday and Thursday | 9:40 - 10:55 AM | Meliora 218 |
| Role | Name | Office | Office Hours |
|---|---|---|---|
| Instructor | C.M. Downey | Lattimore 507 | By appointment |
There is no required textbook for this course. All readings will be research papers or segments of online textbooks, which will be shared via the course discussion platform.
Class attendance and participation are expected and count towards your grade. I will keep track of attendance. Students are allowed to be absent from up to two sessions for any reason, without needing to contact me, and without penalty. These excused absences may be used for travel, illness, catching up with other courses, etc. However, unexcused absences beyond these two sessions will count against the student's final attendance grade, with the exception of important obligations listed below.
This course has a seminar component. Each Tuesday starting in Week 3, students will lead discussions on papers related to the previous Thursday's lecture topic. This format gives you the weekend to read and prepare after hearing the lecture.
Each discussion session features two presenters, each leading approximately 30-35 minutes of class time. The exact number of presentations per student will depend on final enrollment and will be announced after the add/drop period. With typical enrollment, expect to present 2-3 times during the semester.
You will choose a research paper related to the week's topic. Papers must be approved at least one week before your presentation, and shared with the class by the Friday before so classmates can read it. You are encouraged to select papers from your own area of interest—this is an opportunity to explore how data-efficient ML applies to problems you care about.
Your presentation should:
Presentations will be graded on:
After week 1, you will submit a form ranking your topic preferences. I will assign slots algorithmically to maximize preference satisfaction, with priority in later rounds going to students who received lower-ranked preferences earlier. Once assignments are posted, you may trade slots with another student if both parties email me to confirm.
You will complete a substantial term project applying data-efficient ML techniques from this course to an application area of your choice. Projects may be completed individually or in small groups (2-3 students). After the interest survey (M1), I will suggest potential groupings based on shared interests, but group formation is ultimately up to you. You are encouraged to choose an application domain you care about—whether that's NLP, medical imaging, climate science, robotics, or something else entirely.
Your project must pose and test a scientific hypothesis: a specific, falsifiable prediction that your experiments can support or refute. For example: "Active learning will reduce annotation requirements by at least 40% compared to random sampling for task X," or "A model pre-trained on domain Y will outperform general pre-training when fine-tuned with limited labeled data from domain Y."
Deliverables include a final presentation to the class, a GitHub repository with well-documented and reproducible code, and a writeup in the style of a research paper. Progress will be tracked through milestones spaced throughout the semester (see schedule below for dates and details).
These are meant to illustrate the form of a valid project, not to prescribe specific topics:
All deadlines are in Eastern Time. Work should be submitted by 11:00pm the day it is due. Late work incurs the following penalties:
Extensions (without penalty) may be offered if they are requested within a reasonable amount of time (relative to the reason for the extension) before the work is due. Please don't hesitate to ask for an extension if you need one.
Students will not be penalized for civic, ethnic, family, or religious obligations, or university service. You will have a chance to make up missed work for these reasons. Please inform me in advance when possible.
All assignments and activities associated with this course must be performed in accordance with the University of Rochester's Academic Honesty Policy. More information is available here. Please note: The use of Generative AI to produce discussion presentations, project milestone reports, or the final writeup is not allowed. Generative AI is allowed for programming work on the Term Project only.
This policy operates on the honor system. The goal of this course is for you to develop real expertise that you'll need for research, job interviews, and your career. If you outsource your thinking to AI, you mostly shortchange yourself. That said, there's a difference between having AI produce your work and using AI as a tool to help you learn. Using AI to clarify a concept, explore an idea, or deepen your understanding is fine. Using it to generate text or analysis that you then submit as your own is not.
| Date | Topics | Readings | Events |
|---|---|---|---|
| Jan 20 | Introduction: LeCun's Cake, Syllabus | ||
| Jan 22 |
Supervised Learning Review Model Generalization |
||
| Jan 27 | Inductive Bias | ||
| Jan 29 |
Computing cluster setup Discussion format and expectations |
||
| Feb 3 | Unsupervised Learning Part 1: Clustering and MDL |
Project M1 due
(Interest survey) |
|
| Feb 5 | Unsupervised Learning Part 2: Representation Learning | ||
| Feb 10 | Student-led discussion: Unsupervised learning (first discussion day) | ||
| Feb 12 | Self-supervised learning: predicting the missing piece |
Project M2 due
(Two proposals) |
|
| Feb 17 | Student-led discussion: Self-supervised learning | ||
| Feb 19 | Semi-supervised learning | ||
| Feb 24 | Student-led discussion: Semi-supervised learning | ||
| Feb 26 | Active learning |
Project M3 due
(Abstract + plan) |
|
| Mar 3 | Student-led discussion: Active learning | ||
| Mar 5 | Weak supervision | ||
| Mar 7-15 | Spring Break: no class | ||
| Mar 17 | Student-led discussion: Weak supervision | ||
| Mar 19 | Transfer learning | ||
| Mar 24 | Student-led discussion: Transfer learning |
Project M4 due
(Progress + repo) |
|
| Mar 26 | Domain adaptation | ||
| Mar 31 | Student-led discussion: Domain adaptation | ||
| Apr 2 | Few-shot learning | ||
| Apr 7 | Student-led discussion: Few-shot learning |
Code walkthroughs
(Schedule TBD) |
|
| Apr 9 | Meta-learning | ||
| Apr 14 | Student-led discussion: Meta-learning |
Project M5 due
(Final progress) |
|
| Apr 16 | Data augmentation and synthetic data | ||
| Apr 21 | Student-led discussion: Augmentation / Human-in-the-loop | ||
| Apr 23 | Human-in-the-loop ML | ||
| Apr 28 | Project presentations | ||
| Apr 30 | Project presentations |
Writeup due
(Finals week) |
|