Course Description

Many Machine Learning courses assume you have access to large, labeled datasets. But what happens when you don't? Real-world applications routinely face severe data constraints: rare diseases have few documented cases, new products lack historical data, endangered species have limited observations, and most of the world's languages have minimal digital resources. How can we build effective ML systems when data is scarce?

This course will introduce state-of-the-art techniques for data-efficient Machine Learning, focusing on three core areas: (1) transfer learning and domain adaptation, (2) learning with limited supervision (semi-supervised, self-supervised, and active learning), and (3) few-shot learning and data augmentation strategies. Students will engage with recent research literature and gain hands-on experience applying these methods across diverse domains.

Assessment is based on a substantial term project where students implement and evaluate data-efficient techniques on a problem of their choosing, along with smaller assignments throughout the semester.

Days	Time	Location
Tuesday and Thursday	9:40 - 10:55 AM	Meliora 218

Teaching Staff

Role	Name	Office	Office Hours
Instructor	C.M. Downey	Lattimore 507	By appointment

Prerequisites

One foundational course in machine learning, such as: DSCC 240 (Data Mining), DSCC 265 (Statistical Machine Learning), CSC 246 (Machine Learning), or LING 282 (Deep Learning in Computational Linguistics)
Proficiency in Python programming

Course Materials

There is no required textbook for this course. All readings will be research papers or segments of online textbooks, which will be shared via the course discussion platform.

Grading and Policies

Attendance

Class attendance and participation are expected and count towards your grade. I will keep track of attendance. Students are allowed to be absent from up to two sessions for any reason, without needing to contact me, and without penalty. These excused absences may be used for travel, illness, catching up with other courses, etc. However, unexcused absences beyond these two sessions will count against the student's final attendance grade, with the exception of important obligations listed below.

Student-led Discussions

This course has a seminar component. Each Tuesday starting in Week 3, students will lead discussions on papers related to the previous Thursday's lecture topic. This format gives you the weekend to read and prepare after hearing the lecture.

Format

Each discussion session features two presenters, each leading approximately 30-35 minutes of class time. The exact number of presentations per student will depend on final enrollment and will be announced after the add/drop period. With typical enrollment, expect to present 2-3 times during the semester.

Paper Selection

You will choose a research paper related to the week's topic. Papers must be approved at least one week before your presentation, and shared with the class by the Friday before so classmates can read it. You are encouraged to select papers from your own area of interest—this is an opportunity to explore how data-efficient ML applies to problems you care about.

Presentation Expectations

Your presentation should:

Prepare visual materials: Create slides or a handout to help the class follow your presentation. This should include key points, figures from the paper, and your discussion questions.
Summarize the paper: Explain the problem, methods, and key results so that classmates who haven't read it can follow.
Connect to course material: Relate the paper to concepts from the lecture and readings.
Offer your perspective: What are the paper's strengths and limitations? What questions does it raise?
Facilitate discussion: Prepare 2-3 discussion questions to engage the class. Your role is not just to present, but to lead a conversation.

Grading Criteria

Presentations will be graded on:

Clarity of explanation (Can we follow your summary?)
Depth of understanding (Do you grasp the key ideas and methods?)
Critical analysis (Can you identify strengths, weaknesses, and open questions?)
Discussion facilitation (Did you engage the class in meaningful conversation?)

Signup

After week 1, you will submit a form ranking your topic preferences. I will assign slots algorithmically to maximize preference satisfaction, with priority in later rounds going to students who received lower-ranked preferences earlier. Once assignments are posted, you may trade slots with another student if both parties email me to confirm.

Term Project

You will complete a substantial term project applying data-efficient ML techniques from this course to an application area of your choice. Projects may be completed individually or in small groups (2-3 students). After the interest survey (M1), I will suggest potential groupings based on shared interests, but group formation is ultimately up to you. You are encouraged to choose an application domain you care about—whether that's NLP, medical imaging, climate science, robotics, or something else entirely.

Your project must pose and test a scientific hypothesis: a specific, falsifiable prediction that your experiments can support or refute. For example: "Active learning will reduce annotation requirements by at least 40% compared to random sampling for task X," or "A model pre-trained on domain Y will outperform general pre-training when fine-tuned with limited labeled data from domain Y."

Deliverables include a final presentation to the class, a GitHub repository with well-documented and reproducible code, and a writeup in the style of a research paper. Progress will be tracked through milestones spaced throughout the semester (see schedule below for dates and details).

Example Project Shapes

These are meant to illustrate the form of a valid project, not to prescribe specific topics:

Compare two semi-supervised methods on a low-resource task in your domain
Measure how much active learning reduces annotation cost for a specific problem
Investigate whether a particular augmentation strategy helps more for one data type than another
Evaluate transfer learning from different source domains to your target task

Late work

All deadlines are in Eastern Time. Work should be submitted by 11:00pm the day it is due. Late work incurs the following penalties:

Up to 1 hour late: 5%
Up to 24 hours late: 10%
Up to 48 hours late: 20%
Later than 48 hours: not graded (0 for the assignment)

Extensions (without penalty) may be offered if they are requested within a reasonable amount of time (relative to the reason for the extension) before the work is due. Please don't hesitate to ask for an extension if you need one.

Final grading

45%: Term project
30%: Student-led discussions
25%: Participation (attendance + engagement in discussions)

Exceptions

Students will not be penalized for civic, ethnic, family, or religious obligations, or university service. You will have a chance to make up missed work for these reasons. Please inform me in advance when possible.

Academic honesty

All assignments and activities associated with this course must be performed in accordance with the University of Rochester's Academic Honesty Policy. More information is available here. Please note: The use of Generative AI to produce discussion presentations, project milestone reports, or the final writeup is not allowed. Generative AI is allowed for programming work on the Term Project only.

This policy operates on the honor system. The goal of this course is for you to develop real expertise that you'll need for research, job interviews, and your career. If you outsource your thinking to AI, you mostly shortchange yourself. That said, there's a difference between having AI produce your work and using AI as a tool to help you learn. Using AI to clarify a concept, explore an idea, or deepen your understanding is fine. Using it to generate text or analysis that you then submit as your own is not.

Schedule

(subject to change)

Date	Topics	Readings	Events
Jan 20	Introduction: LeCun's Cake, Syllabus
Jan 22	Supervised Learning Review Model Generalization
Jan 27	Inductive Bias
Jan 29	Computing cluster setup Discussion format and expectations
Feb 3	Unsupervised Learning Part 1: Clustering and MDL		Project M1 due (Interest survey)
Feb 5	Unsupervised Learning Part 2: Representation Learning
Feb 10	Student-led discussion: Unsupervised learning	Unsupervised Discovery of Morphemes Latent Personality Profiles of Analog Astronauts
Feb 12	Self-supervised Learning
Feb 17	Student-led discussion: Self-supervised learning	wav2vec 2.0 Self-Supervised Ecosystem Monitoring
Feb 19	Semi-supervised Learning	Lil'Log Post
Feb 24	Student-led discussion: Semi-supervised learning	Semi-Supervised OCR Post-Correction MixMatch	Project M2 due (Two proposals)
Feb 26	Active Learning / Weak Supervision	Active Learning Literature Survey (Settles, 2010) Lil'Log Post
Mar 3	Student-led discussion: Active Learning / Weak Supervision	Deeply Supervised Active Learning Applying LLMs to Active Learning
Mar 5	Term Project Peer-Review Workshop		Project M3 due Friday 3/6 (Abstract + plan)
Mar 7-15	Spring Break: no class
Mar 17	Student-led discussion: Weak supervision
Mar 19	Transfer learning
Mar 24	Student-led discussion: Transfer learning		Project M4 due (Progress + repo)
Mar 26	Domain adaptation
Mar 31	Student-led discussion: Domain adaptation
Apr 2	Few-shot learning
Apr 7	Student-led discussion: Few-shot learning		Code walkthroughs (Schedule TBD)
Apr 9	Meta-learning
Apr 14	Student-led discussion: Meta-learning		Project M5 due (Final progress)
Apr 16	Data augmentation and synthetic data
Apr 21	Student-led discussion: Augmentation / Human-in-the-loop
Apr 23	Human-in-the-loop ML
Apr 28	Project presentations
Apr 30	Project presentations		Writeup due (Finals week)