Data Sets

RECAP is a benchmark for intent rewriting in conversational AI. It evaluates how LLMs turn ambiguous, underspecified, or shifting dialogue into clear, planning-ready intent—helping developers build more reliable and effective agentic systems.

FactLens is a benchmark for fine-grained fact verification with LLMs. It breaks complex claims into sub-claims, enabling more precise error detection, better transparency, and high-quality evaluation aligned with human judgments.

HappyDB is a large, crowd-sourced database containing 100,000 “happy moments,” designed to advance NLP technology in understanding expressions of happiness in text. This project aims to uncover insights into happiness-inducing events and develop systems that suggest actions to enhance well-being. Positioned at the intersection of NLP and positive psychology, HappyDB offers a unique resource for research on understanding and fostering happiness.
Previous research has primarily focused on examining how entities influence retrieval models and knowledge recall in LMs,leaving other aspects relatively unexplored. In this work, our goal is to offer a more detailed, fact-centric analysis by exploring the effects of combinations of entities and relations. To facilitate this, we construct a new question answering (QA) dataset called WITQA (Wikipedia Triple Question Answers). This dataset includes questions about entities and relations of various popularity levels, each accompanied by a supporting passage. Our extensive experiments with diverse LMs and retrievers reveal when retrieval does not consistently enhance LMs from the viewpoints of fact-centric popularity.
This dataset includes the ambiguity categories and their corresponding additional instructions to mitigate each ambiguity. It was constructed through an LLM-in-the-loop annotation process on the Super-natural instructions benchmark. It comprises of comprising 2,500 instances annotated with ambiguity taxonomy and corresponding additional instructions.
Holistic Reasoning Benchmark (HoloBench) is a new framework specifically designed to evaluate LCLMs’ ability to perform holistic reasoning over long contexts. HoloBench leverages database reasoning operations to systematically evaluate how well models can aggregate, compare, and draw conclusions from distributed information. By adapting existing text-to-SQL benchmarks. HoloBench enables an automated and scalable evaluation process, eliminating the need for labor-intensive manual annotations. A key innovation of HoloBench is its ability to control three critical factors that influence LCLM performance: (1) the length of the context and the amount of information contained within, (2) the position of relevant information within the context, and (3) the type and difficulty of queries. These factors allow for a more comprehensive assessment of LCLMs’ holistic reasoning capabilities.
6 Min Read
November 7, 2025
“Mixed Signals,” exposes hidden biases in VLMs with major implications for healthcare, RAG systems, and AI safety.
9 Min Read
July 15, 2025
Stream processing is a key ingredient when making “agentic workflows” enterprise-ready. Streams support a wide range of workflows and support complexity while at the same time bringing about the right abstractions and scope for facilitating accuracy, scalability, and ease of use.
6 Min Read
February 5, 2025
With the MCRank benchmark and our EXSIR method, we’ve shown that LLMs can significantly improve their performance on these challenging tasks when guided by structured reasoning.