C.M. Downey
Assistant Professor: Linguistics and Data Science, University of Rochester
About Me
My research develops methods to improve the efficacy of Natural Language Processing (NLP) tools for under-resourced languages (those lacking the abundant data needed to train modern machine learning models). The most common approach to building machine learning systems is to train huge neural networks on high-resource languages like English and Chinese, for which vast amounts of textual data (i.e. hundreds of gigabytes) are available. Such techniques are inapplicable to the majority of the world's languages, which lack the large requisite text datasets. This methodological gap undermines the potentially vital role these systems can play in creating tools such as assisted completion and keyboard auto-correct features, automatic speech recognition, and machine translation services. Development of such tools helps ensure that minority and endangered languages can thrive in the digital era. To address this gap, I specialize in machine learning techniques that are applicable to under-resourced languages, with a strong emphasis on:
- unsupervised/self-supervised learning, enabling training with raw text or much smaller amounts of specialized data than supervised paradigms
- multilingual modeling, allowing language data to be pooled by training on more than one language at once
- transfer learning, leveraging existing models trained in higher-resource languages for use with new, low-resource ones
My contributions to this agenda include projects that focus on unsupervised morpheme segmentation, linguistically-informed multilingual modeling, cross-lingual model transfer, and unsupervised machine translation. For a more detailed breakdown of my work, please see the CV information and links on this page.
Education
-
Ph.D., Computational Linguistics
September 2018 - June 2024University of Washington; Advised by Gina-Anne Levow and Shane Steinert-Threlkeld
M.S. Computational Linguistics earned upon candidacy, June 2022
-
B.S. Linguistics, Russian
August 2014 - May 2018Tulane University; Advised by Charles Mignot and Judith Maxwell
Employment
-
University of Washington Department of Linguistics
September 2018 - June 2024Teaching/Research Assistant
-
MSR Cryptography and Privacy
June - September 2022Microsoft; Research Intern; Supervised by Kim Laine
-
Siri Web Answers
June - September 2020Apple; AI|ML Intern; Supervised by Chris DuBois
Research Positions
-
Presidential Dissertation Fellowship
January - March 2024UW Graduate School
-
Northwest Sahaptin Language Text Dissemination
September - December 2023UW Royalty Research Fund. Research Assistant; Supervised by Sharon Hargus (PI)
-
Excellence in Linguistic Research Fellowship
September - December 2022UW Department of Linguistics
-
Learning to Translate by Learning to Communicate
June - September 2021UW Royalty Research Fund. Research Assistant; Supervised by Shane Steinert-Threlkeld (PI)
-
Foreign Language and Area Studies Fellowship - Inuktitut Language
September 2019 - June 2020U.S. Department of Education, UW Jackson School of International Studies
-
Low Resource Languages for Emergent Incedents
August - October 2019DARPA. Research Assistant; Supervised by Gina-Anne Levow
-
Newberry Research Library Summer Institute
July - August 2019Fellow; Supervised by Jenny Davis
Publications
2024
2023
-
Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies
October 2023
to New LanguagesProceedings of the 3rd Workshop on Multilingual Representation Learning
ACL Anthology Arxiv Code
-
Learning to Translate by Learning to Communicate
October 2023Proceedings of the 3rd Workshop on Multilingual Representation Learning
ACL Anthology Arxiv Code
2022
-
Planting and Mitigating Memorized Content in Predictive-Text Language Models
December 2022Preprint
Arxiv
-
A Masked Segmental Language Model for Natural Language Segmentation
June 2022Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
ACL Anthology Arxiv Code
-
Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
May 2022Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
ACL Anthology Arxiv Code
-
Emergent Communication Fine-tuning (EC-FT) for Pretrained Language Models
April 2022Proceedings of the 5th Annual Workshop on Emergent Communication
OpenReview (official) Code
Talks
Refereed
-
Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies
December 2023
to New Languages3rd Workshop on Multilingual Representation Learning
-
Learning to Translate by Learning to Communicate
December 20233rd Workshop on Multilingual Representation Learning (poster)
-
A Masked Segmental Language Model for Natural Language Segmentation
July 202219th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
-
Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
May 202260th Annual Meeting of the Association for Computational Linguistics (poster)
Invited
-
Improving Computational Tools for Under-resourced and Endangered Languages
University of British Columbia Department of Linguistics
February 15 2024University of Pittsburgh Department of Linguistics
February 9 2024Macalester College Department of Linguistics
January 30 2024University of Utah Department of Linguistics
January 23 2024University of Rochester Goergen Institute for Data Science and Department of Linguistics
November 29 2023
-
Comparing Methods to Adapt Multilingual Vocabularies to New Languages
September 13 2023UW NLP Retreat
-
Collaborative Coding Best Practices
UW Computational Linguistics Treehouse
February 10 2023UW Computational Linguistics Treehouse
October 22 2021
-
Introduction to Differential Privacy
October 14 2022UW Computational Linguistics Treehouse
-
Learning to Translate by Learning to Communicate
September 26 2022UW NLP Retreat
-
Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
May 5 2022UW Electrical Engineering TIAL Lab (PI: Mari Ostendorf)
-
A Masked Segmental Language Model for Unsupervised Sequence Segmentation
September 20 2021UW NLP Retreat
-
Segmental Language Modeling
April 20 2021UW CLMBR Lab Group (PI: Shane Steinert-Threlkeld)
-
Archival Work and Language Revitalization at the NCAIS Summer Institute
December 13 2019UW Linguistics Field Reports Meeting
-
Dependency vs Phrase-Structure Trees
October 4 2019Uiversity of Washington Linguistics Syntax Roundtable
-
Subword Segmentation for Morphologically Complex Languages
September 28 2019UW NLP Retreat
Teaching
-
UW - LING 473: Basics for Computational Linguistics
Summer 2023Co-Instructor with Emily Proch Ahn
-
UW - LING 574: Deep Learning for Natural Language Processing
Instructor (course webpage)
Spring 2023Teaching Assistant
Spring 2022Teaching Assistant
Spring 2021
-
UW - LING 572: Advanced Statistical Methods for Natural Language Processing
Winter 2023Teaching Assistant
-
UW - LING 200: Introduction to Linguistic Thought
Teaching Assistant
Winter 2022Teaching Assistant
Spring 2019Teaching Assistant
Fall 2018
-
UW - LING 269: Swearing and Taboo Language
Fall 2021Teaching Assistant
-
UW - LING 566: Introduction to Syntax for Computational Linguistics
Fall 2020Teaching Assistant
-
UW - ASL 305: Introduction to American Deaf Culture
Spring 2019Grader
-
UW - LING 406: Introduction to Syntax
Fall 2018Grader
Guest Lectures
-
Multilingual Language Modeling
as part of LING 574: Deep Learning for NLP; University of Washington; Instructed by Shane Steinert-Threlkeld
May 22 2024
May 18 2022
May 26 2021
-
Computational Linguistics and Language Revitalization
May 4 2020as part of LING 234: Language and Diversity; University of Washington; Instructed by Lorna Rozelle
-
Computational Linguistics and Language Revitalization
January 27 2020as part of ENGL 4717: NAIS Capstone Seminar; University of Colorado; Instructed by Penelope Kelsey
Service
-
ACL Reviewer-Paper Assignment System
December 2020 - February 2021Research Assistant/Developer: Supervised by Fei Xia