Datasets
Viewing 31-40 of 86 datasets
Real Toxicity Prompts
A dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.Mosaic • 2020A dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.eQASC: Multihop Explanations for QASC
98k annotated explanations for the QASC datasetAristo • 2020This dataset contains 98k 2-hop explanations for questions in the QASC dataset, with annotations indicating if they are valid (~25k) or invalid (~73k) explanations.hasPart KB
A high-quality KB of hasPart relationsAristo • 2020A high-quality knowledge base of ~50k hasPart relationships, extracted from a large corpus of generic statements.SciDocs
Academic paper representation dataset accompanying the SPECTER paper/modelSemantic Scholar • 2020Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives…GenericsKB
A large knowledge base of generic sentencesAristo • 2020The GenericsKB contains 3.4M+ generic sentences about the world, i.e., sentences expressing general truths such as "Dogs bark," and "Trees remove carbon dioxide from the atmosphere." Generics are potentially useful as a knowledge source for AI systems…SciFact
1.4K expert-written scientific claims paired with evidence-containing abstracts.Semantic Scholar • 2020Due to the rapid growth in the scientific literature, there is a need for automated systems to assist researchers and the public in assessing the veracity of scientific claims. To facilitate the development of systems for this task, we introduce SciFact, a…TORQUE
A new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships.AllenNLP • 2020A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have…Contrast Sets
Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities.AllenNLP • 2020Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on…CORD-19: COVID-19 Open Research Dataset
Tens of thousands of scholarly articles about COVID-19 and related coronavirusesSemantic Scholar • 2020CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.Break
83,978 examples sampled from 10 question answering datasets over text, images and databases.AI2 Israel, Question Understanding • 2020Break is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases.