
Learn more about AI2's Lasting Impact Award
Viewing 51-60 of 293 papers
  • Risks and NLP Design: A Case Study on Procedural Document QA

    Nikita Haduong, Alice Gao, Noah A. SmithACL • Findings2023 As NLP systems are increasingly deployed at scale, concerns about their potential negative impacts have attracted the attention of the research community, yet discussions of risk have mostly been at an abstract level and focused on generic AI or NLP…
  • Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh HajishirziACL2023 Large “instruction-tuned” language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in…
  • Stubborn Lexical Bias in Data and Models

    Sofia Serrano, Jesse Dodge, Noah A. SmithACL2023 In NLP, recent work has seen increased focus on spurious correlations between various features and labels in training data, and how these influence model behavior. However, the presence and effect of such correlations are typically examined feature by feature…
  • Task-aware Retrieval with Instructions

    Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, Wen-tau YihACL • Findings2023 We study the problem of retrieval with instructions, where users of a retrieval system explicitly describe their intent along with their queries. We aim to develop a general-purpose task-aware retrieval system using multi-task instruction tuning, which can…
  • When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, Hannaneh HajishirziACL2023 Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the difficulty of encoding a wealth of world knowledge in their parameters. This paper aims to understand LMs…
  • Words as Gatekeepers: Measuring Discipline-specific Terms and Meanings in Scholarly Publications

    Li Lucy, Jesse Dodge, David Bamman, Katherine A. KeithFindings of ACL2023 Scholarly text is often laden with jargon, or specialized language that can facilitate efficient in-group communication within fields but hinder understanding for out-groups. In this work, we develop and validate an interpretable approach for measuring…
  • Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

    Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar KhotICML 2023, the Challenges in Deployable Generative AI workshop2023 As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large…
  • ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews

    Mike D'Arcy, Alexis Ross, Erin Bransom, Bailey Kuehl, Jonathan Bragg, Tom Hope, Doug DowneyarXiv.org2023 Revising scientific papers based on peer feedback is a challenging task that requires not only deep scientific knowledge and reasoning, but also the ability to recognize the implicit requests in high-level feedback and to choose the best of many possible ways…
  • Evaluating the Social Impact of Generative AI Systems in Systems and Society

    Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daum'e, Jesse Dodge, Ellie Evans, Sara Hooker, Yacine Jernite, A. Luccioni, Alberto Lusoli, Margaret Mitchell, J. Newman, Marie-Therese Png, A. Strait, Apostol T. VassilevarXiv.org2023 Generative AI systems across modalities, ranging from text, image, audio, and video, have broad social impacts, but there exists no official standard for means of evaluating those impacts and which impacts should be evaluated. We move toward a standard…
  • Morphosyntactic probing of multilingual BERT models

    Judit Ács, Endre Hamerlik, Roy Schwartz, Noah A. Smith, András KornaiJournal of Natural Language Engineering2023 We introduce an extensive dataset for multilingual probing of morphological information in language models (247 tasks across 42 languages from 10 families), each consisting of a sentence with a target word and a morphological tag as the desired label, derived…