Papers

Learn more about AI2's Lasting Impact Award
Viewing 21-30 of 991 papers
  • Calibrating Large Language Models with Sample Consistency

    Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan, Chris Callison-BurcharXiv2024 Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and…
  • Improving Stratocumulus Cloud Amounts in a 200‐m Resolution Multi‐Scale Modeling Framework Through Tuning of Its Interior Physics

    Liran Peng, P. Blossey, W. Hannah, C. Bretherton, C. Terai, A. Jenney, M. PritchardJournal of Advances in Modeling Earth Systems2024 High‐Resolution Multi‐scale Modeling Frameworks (HR)—global climate models that embed separate, convection‐resolving models with high enough resolution to resolve boundary layer eddies—have exciting potential for investigating low cloud feedback dynamics due…
  • Global Precipitation Correction Across a Range of Climates Using CycleGAN

    Jeremy McGibbon, S. K. Clark, Brian Henn, Anna Kwa, Oliver Watt‐Meyer, W. Perkins, Christopher S. Bretherton, S. K. ClarkGeophysical Research Letters2024 Accurate precipitation simulations for various climate scenarios are critical for understanding and predicting the impacts of climate change. This study employs a Cycle‐generative adversarial network (CycleGAN) to improve global 3‐hr‐average precipitation…
  • TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

    Yikai Zhang, Siyu Yuan, Caiyu Hu, Kyle Richardson, Yanghua Xiao, Jiangjie ChenarXiv2024 Despite remarkable advancements in emulating human-like behavior through Large Language Models (LLMs), current textual simulations do not adequately address the notion of time. To this end, we introduce TimeArena, a novel textual simulated environment that…
  • OLMo: Accelerating the Science of Language Models

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, A. Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Daniel Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hanna HajishirziarXiv2024 Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of…
  • Neural Network Parameterization of Subgrid‐Scale Physics From a Realistic Geography Global Storm‐Resolving Simulation

    Oliver Watt‐Meyer, Noah D. Brenowitz, S. K. Clark, Brian Henn, Anna Kwa, Jeremy McGibbon, W. Perkins, Lucas Harris, Christopher S. BrethertonJournal of Advances in Modeling Earth Systems2024 Parameterization of subgrid‐scale processes is a major source of uncertainty in global atmospheric model simulations. Global storm‐resolving simulations use a finer grid (less than 5 km) to reduce this uncertainty by explicitly resolving deep convection and…
  • The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

    Peter Hase, Mohit Bansal, Peter Clark, Sarah WiegreffearXiv.org2024 How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have…
  • Tropical Cirrus Are Highly Sensitive to Ice Microphysics Within a Nudged Global Storm‐Resolving Model

    R. Atlas, C. Bretherton, A. Sokol, P. Blossey, M. F. KhairoutdinovGeophysical Research Letters2024 Cirrus dominate the longwave radiative budget of the tropics. For the first time, the variability in cirrus properties and longwave cloud radiative effects (CREs) that arises from using different microphysical schemes within nudged global storm‐resolving…
  • Paloma: A Benchmark for Evaluating Language Model Fit

    Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, A. Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hanna Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse DodgearXiv2023 Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of language. Rather than assuming perplexity on one distribution…
  • Catwalk: A Unified Language Model Evaluation Framework for Many Datasets

    Dirk Groeneveld, Anas Awadalla, Iz Beltagy, Akshita Bhagia, Ian Magnusson, Hao Peng, Oyvind Tafjord, Pete Walsh, Kyle Richardson, Jesse DodgearXiv.org2023 The success of large language models has shifted the evaluation paradigms in natural language processing (NLP). The community's interest has drifted towards comparing NLP models across many tasks, domains, and datasets, often at an extreme scale. This imposes…