C.M. Downey

Assistant Professor: Linguistics and Data Science, University of Rochester

About Me

My research develops methods to improve the efficacy of Natural Language Processing (NLP) tools for under-resourced languages (those lacking the abundant data needed to train modern machine learning models). The most common approach to building machine learning systems is to train huge neural networks on high-resource languages like English and Chinese, for which vast amounts of textual data (i.e. hundreds of gigabytes) are available. Such techniques are inapplicable to the majority of the world's languages, which lack the large requisite text datasets. This methodological gap undermines the potentially vital role these systems can play in creating tools such as assisted completion and keyboard auto-correct features, automatic speech recognition, and machine translation services. Development of such tools helps ensure that minority and endangered languages can thrive in the digital era. To address this gap, I specialize in machine learning techniques that are applicable to under-resourced languages, with a strong emphasis on:

  • unsupervised/self-supervised learning, enabling training with raw text or much smaller amounts of specialized data than supervised paradigms
  • multilingual modeling, allowing language data to be pooled by training on more than one language at once
  • transfer learning, leveraging existing models trained in higher-resource languages for use with new, low-resource ones

My contributions to this agenda include projects that focus on unsupervised morpheme segmentation, linguistically-informed multilingual modeling, cross-lingual model transfer, and unsupervised machine translation. For a more detailed breakdown of my work, please see the CV information and links on this page.

Education

Employment

  • University of Washington Department of Linguistics
    September 2018 - June 2024

    Teaching/Research Assistant

  • MSR Cryptography and Privacy
    June - September 2022

    Microsoft; Research Intern; Supervised by Kim Laine

  • Siri Web Answers
    June - September 2020

    Apple; AI|ML Intern; Supervised by Chris DuBois

Research Positions

  • Presidential Dissertation Fellowship
    January - March 2024

    UW Graduate School

  • Northwest Sahaptin Language Text Dissemination
    September - December 2023

    UW Royalty Research Fund. Research Assistant; Supervised by Sharon Hargus (PI)

  • Excellence in Linguistic Research Fellowship
    September - December 2022

    UW Department of Linguistics

  • Learning to Translate by Learning to Communicate
    June - September 2021

    UW Royalty Research Fund. Research Assistant; Supervised by Shane Steinert-Threlkeld (PI)

  • Foreign Language and Area Studies Fellowship - Inuktitut Language
    September 2019 - June 2020

    U.S. Department of Education, UW Jackson School of International Studies

  • Low Resource Languages for Emergent Incedents
    August - October 2019

    DARPA. Research Assistant; Supervised by Gina-Anne Levow

  • Newberry Research Library Summer Institute
    July - August 2019

    Fellow; Supervised by Jenny Davis

Publications

2024

  • Targeted Multilingual Adaptation for Low-resource Language Families
    May 2024

    Preprint

    C.M. Downey, Terra Blevins, Dhwani Serai, Dwija Parikh, Shane Steinert-Threlkeld

    Arxiv Code

2023

  • Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies
    to New Languages
    October 2023

    Proceedings of the 3rd Workshop on Multilingual Representation Learning

    C.M. Downey, Terra Blevins, Nora Goldfine, Shane Steinert-Threlkeld

    ACL Anthology Arxiv Code
  • Learning to Translate by Learning to Communicate
    October 2023

    Proceedings of the 3rd Workshop on Multilingual Representation Learning

    C.M. Downey, Leo Z. Liu, Xuhui Zhou, Shane Steinert-Threlkeld

    ACL Anthology Arxiv Code

2022

  • Planting and Mitigating Memorized Content in Predictive-Text Language Models
    December 2022

    Preprint

    C.M. Downey, Wei Dai, Huseyin A. Inan, Kim Laine, Saurabh Naik, Tomasz Religa

    Arxiv
  • A Masked Segmental Language Model for Natural Language Segmentation
    June 2022

    Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

    C.M. Downey, Fei Xia, Gina-Anne Levow, Shane Steinert-Threlkeld

    ACL Anthology Arxiv Code
  • Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
    May 2022

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics

    C.M. Downey, Shannon Drizin, Levon Haroutunian, Shivin Thukral

    ACL Anthology Arxiv Code
  • Emergent Communication Fine-tuning (EC-FT) for Pretrained Language Models
    April 2022

    Proceedings of the 5th Annual Workshop on Emergent Communication

    Shane Steinert-Threlkeld, Leo Z. Liu, Xuhui Zhou, C.M. Downey

    OpenReview (official) Code

Talks

Refereed

  • Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies
    to New Languages
    December 2023

    3rd Workshop on Multilingual Representation Learning

  • Learning to Translate by Learning to Communicate
    December 2023

    3rd Workshop on Multilingual Representation Learning (poster)

  • A Masked Segmental Language Model for Natural Language Segmentation
    July 2022

    19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

  • Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
    May 2022

    60th Annual Meeting of the Association for Computational Linguistics (poster)

Invited

  • Improving Computational Tools for Under-resourced and Endangered Languages
    • University of British Columbia Department of Linguistics

      February 15 2024
    • University of Pittsburgh Department of Linguistics

      February 9 2024
    • Macalester College Department of Linguistics

      January 30 2024
    • University of Utah Department of Linguistics

      January 23 2024
    • University of Rochester Goergen Institute for Data Science and Department of Linguistics

      November 29 2023
  • Comparing Methods to Adapt Multilingual Vocabularies to New Languages
    September 13 2023

    UW NLP Retreat

  • Collaborative Coding Best Practices
    • UW Computational Linguistics Treehouse

      February 10 2023
    • UW Computational Linguistics Treehouse

      October 22 2021
  • Introduction to Differential Privacy
    October 14 2022

    UW Computational Linguistics Treehouse

  • Learning to Translate by Learning to Communicate
    September 26 2022

    UW NLP Retreat

  • Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
    May 5 2022

    UW Electrical Engineering TIAL Lab (PI: Mari Ostendorf)

  • A Masked Segmental Language Model for Unsupervised Sequence Segmentation
    September 20 2021

    UW NLP Retreat

  • Segmental Language Modeling
    April 20 2021

    UW CLMBR Lab Group (PI: Shane Steinert-Threlkeld)

  • Archival Work and Language Revitalization at the NCAIS Summer Institute
    December 13 2019

    UW Linguistics Field Reports Meeting

  • Dependency vs Phrase-Structure Trees
    October 4 2019

    Uiversity of Washington Linguistics Syntax Roundtable

  • Subword Segmentation for Morphologically Complex Languages
    September 28 2019

    UW NLP Retreat

Teaching

  • UW - LING 473: Basics for Computational Linguistics
    Summer 2023

    Co-Instructor with Emily Proch Ahn

  • UW - LING 574: Deep Learning for Natural Language Processing
    • Instructor (course webpage)

      Spring 2023
    • Teaching Assistant

      Spring 2022
    • Teaching Assistant

      Spring 2021
  • UW - LING 572: Advanced Statistical Methods for Natural Language Processing
    Winter 2023

    Teaching Assistant

  • UW - LING 200: Introduction to Linguistic Thought
    • Teaching Assistant

      Winter 2022
    • Teaching Assistant

      Spring 2019
    • Teaching Assistant

      Fall 2018
  • UW - LING 269: Swearing and Taboo Language
    Fall 2021

    Teaching Assistant

  • UW - LING 566: Introduction to Syntax for Computational Linguistics
    Fall 2020

    Teaching Assistant

  • UW - ASL 305: Introduction to American Deaf Culture
    Spring 2019

    Grader

  • UW - LING 406: Introduction to Syntax
    Fall 2018

    Grader

Guest Lectures

  • Multilingual Language Modeling

    as part of LING 574: Deep Learning for NLP; University of Washington; Instructed by Shane Steinert-Threlkeld

    • May 22 2024

    • May 18 2022

    • May 26 2021

  • Computational Linguistics and Language Revitalization
    May 4 2020

    as part of LING 234: Language and Diversity; University of Washington; Instructed by Lorna Rozelle

  • Computational Linguistics and Language Revitalization
    January 27 2020

    as part of ENGL 4717: NAIS Capstone Seminar; University of Colorado; Instructed by Penelope Kelsey

Service

  • ACL Reviewer-Paper Assignment System
    December 2020 - February 2021

    Research Assistant/Developer: Supervised by Fei Xia