Datasets

Find us on Hugging Face
AI2's latest open-source models and datasets can be found on our Hugging Face page.

Viewing 11-20 of 86 datasets

Multihop Questions via Single-hop Question Composition
Multihop reading comprehension dataset with 2-4 hop questions.Aristo • 2022MuSiQue is a multihop reading comprehension dataset with 2-4 hop questions, built by composing seed questions from 5 existing single-hop datasets. The dataset is constructed with a bottom-up approach that systematically selects composable pairs of single-hop…
Drug Combinations Dataset
A Dataset for N-ary Relation Extraction of Drug CombinationsAI2 Israel • 2022Combination therapies have become the standard of care for diseases such as cancer, tuberculosis, malaria and HIV. However, the combinatorial set of available multi-drug treatments creates a challenge in identifying effective combination therapies available…
S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications
A dataset to study mentorship relationships in academia and corporate research labsSemantic Scholar • 2022Mentorship is a critical component of academia, but is not as visible as publications, citations, grants, and awards. Despite the importance of studying the quality and impact of mentorship, there are few large representative mentorship datasets available. We…
NumGLUE
NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks • 2022Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to…
Web10K Dataset
38,176 queries and corresponding 1M+ images returned by Bing Image SearchPRIOR • 2022Web10K is a dataset sourced from web image search data with over 10K concepts. It consists of 38,176 queries and the corresponding 1M+ images returned by Bing Image Search. Web10K provides dense coverage of feasible adjective-noun and verb-noun combinations…
The Fermi Challenge
A challenge dataset of Fermi (estimation) problems, currently beyond the capabilities of modern methods.Aristo • 2021A challenge dataset of Fermi (estimation) problems, currently beyond the capabilities of modern methods.
Qasper
Question Answering on Research PapersAllenNLP, Semantic Scholar • 2021A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.
BeliefBank
4998 facts and 12147 constraints to test a model's consistencyAristo • 2021Dataset of 4998 simple facts and 12147 constraints to test, and improve, a model's accuracy and consistency
EntailmentBank
2k multi-step entailment trees, explaining the answers to ARC science questionsAristo • 20212k multi-step entailment trees, explaining the answers to ARC science questions
S2AND: Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite
A dataset to train and evaluate models that do author disambiguation aka figuring out who wrote which paperSemantic Scholar • 2021A unified benchmark dataset for AND on scholarly papers, as well as an open-source reference model implementation. Our dataset harmonizes eight disparate AND datasets into a uniform format, with a single rich feature set drawn from the Semantic Scholar (S2…

1
2
3
•••
9

Natural Language Processing

Computer Vision

AI for the Environment

Experimentation and Communication

Research

Research

Datasets

Multihop Questions via Single-hop Question Composition

Drug Combinations Dataset

S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications

NumGLUE

Web10K Dataset

The Fermi Challenge

Qasper

BeliefBank

EntailmentBank

S2AND: Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite