Data Sets

HappyDB is a large, crowd-sourced database containing 100,000 “happy moments,” designed to advance NLP technology in understanding expressions of happiness in text. This project aims to uncover insights into happiness-inducing events and develop systems that suggest actions to enhance well-being. Positioned at the intersection of NLP and positive psychology, HappyDB offers a unique resource for research on understanding and fostering happiness.
Previous research has primarily focused on examining how entities influence retrieval models and knowledge recall in LMs,leaving other aspects relatively unexplored. In this work, our goal is to offer a more detailed, fact-centric analysis by exploring the effects of combinations of entities and relations. To facilitate this, we construct a new question answering (QA) dataset called WITQA (Wikipedia Triple Question Answers). This dataset includes questions about entities and relations of various popularity levels, each accompanied by a supporting passage. Our extensive experiments with diverse LMs and retrievers reveal when retrieval does not consistently enhance LMs from the viewpoints of fact-centric popularity.
This dataset includes the ambiguity categories and their corresponding additional instructions to mitigate each ambiguity. It was constructed through an LLM-in-the-loop annotation process on the Super-natural instructions benchmark. It comprises of comprising 2,500 instances annotated with ambiguity taxonomy and corresponding additional instructions.
Holistic Reasoning Benchmark (HoloBench) is a new framework specifically designed to evaluate LCLMs’ ability to perform holistic reasoning over long contexts. HoloBench leverages database reasoning operations to systematically evaluate how well models can aggregate, compare, and draw conclusions from distributed information. By adapting existing text-to-SQL benchmarks. HoloBench enables an automated and scalable evaluation process, eliminating the need for labor-intensive manual annotations. A key innovation of HoloBench is its ability to control three critical factors that influence LCLM performance: (1) the length of the context and the amount of information contained within, (2) the position of relevant information within the context, and (3) the type and difficulty of queries. These factors allow for a more comprehensive assessment of LCLMs’ holistic reasoning capabilities.
XATU is a novel fine-grained instruction-based benchmark designed for explainable text updates. The benchmark leverages high-quality existing data sources from different tasks to enable automatic evaluation of LLM editing capabilities by incorporating an LLM-in-the-loop annotation process. In comparison to other datasets and benchmarks, XATU highlights its inclusion of a wider range of diverse topics, fine-grained edit instructions, and corresponding explanation rationales.
SubjQA is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer). Moreover, both questions and answer spans are assigned a subjectivity label by annotators. Questions such as “How much does this product weigh?” is a factual question (i.e., low subjectivity), while “Is this easy to use?” is a subjective question (i.e., high subjectivity).
3 Min Read
May 16, 2024
To push the boundaries of text editing with LLMs, we introduce XATU—a new text editing benchmark that incorporates fine-grained instructions and gold-standard edit explanations for explainable text updates.