As language models scale in size and capability, so do our expectations for how they process and understand information. Today’s real-world problems rarely involve isolated paragraphs—they require reading, comparing, and reasoning over multiple documents, often of varying length, format, and quality.
We present three new papers that tackle a pressing and underexplored topic in NLP: multi-document reasoning. These works offer rigorous benchmarks, novel methodologies, and empirical insights into how large language models (LLMs) handle complexity across multiple sources of information.
Let’s take a closer look at what these contributions reveal—and how they collectively push the boundaries of multi-document understanding.
Holistic Reasoning with Long-Context LMs: Introducing HoloBench
Modern tasks demand more than just retrieving small amounts of relevant snippets; they require models to synthesize and reason across the entire corpora. Retrieval-augmented generation (RAG) systems have taken us far in accessing relevant content, but they often stumble when asked to aggregate and make sense of information scattered across documents.
Long-context Language Models (LCLMs) offer the premise of “reading” large swaths of information all at once. But how well do they reason over this data?
In “Holistic Reasoning with Long-Context LMs” we introduce HoloBench, a benchmark designed to evaluate whether LCLMs can perform database-style operations—like filtering, aggregation, and comparison—on unstructured text. It’s a systematic framework that varies key factors such as:
- Context length
- Information density
- Distribution of relevant information
- Query complexity
The findings are eye-opening: LCLM performance is more sensitive to how much information (i.e., the amount of information in the context that is relevant to the query) is packed into the context than to the length of that context. Moreover, tasks requiring aggregation of multiple facts across the input lead to noticeable performance drops, especially as query complexity increases.
These insights reveal a critical bottleneck in current LCLMs and point to where future work must focus: not just expanding context windows but developing frameworks that facilitate more effective reasoning over large contexts.
This paper was presented at #ICLR2025.
Multi-Conditional Ranking with LLMs
In many cases, we require multi-document reasoning by selecting, ordering, and ranking items based on numerous, sometimes conflicting conditions. This is a common scenario in recommendation systems, policy generation, and any domain where tradeoffs must be made.
In “Multi-Conditional Ranking with Large Language Models,” we introduce MCRank, a benchmark that challenges LLMs to rank a set of items according to multiple conditions.
Baseline evaluations reveal a consistent trend: performance degrades rapidly as the number and complexity of conditions increase.
To address this, the paper proposes EXSIR, a decomposed reasoning approach:
- Extract the ranking conditions
- Sort them into a logical order based on their priority
- Iteratively Rank items while reasoning through each condition
EXSIR achieves an up to 14.4% performance gain over baseline models, significantly outperforming chain-of-thought prompting and standard ranking approaches.
The takeaway? Decomposition matters—and how we structure the reasoning process can make all the difference for complex multi-document tasks.
This paper was presented at the NAACL 2025 Main Conference.
From Single to Multi: Understanding Hallucinations in Multi-Document Summarization
As summarization moves beyond single documents, new challenges emerge—among them, the escalation of hallucinations. Unlike single-document settings, where grounding is relatively straightforward, multi-document summarization introduces ambiguity, conflicting facts, and a wider space for errors.
In “From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization,” we investigate how LLMs hallucinate when summarizing information across documents. Since no existing benchmark tackled this directly, the team created two new benchmarks on top of an existing dataset: one based on news articles and another on multi-speaker conversations, each annotated with topic-specific insights.
The results are striking:
- Up to 75% of generated summaries contain hallucinated content.
- Hallucinations tend to cluster toward the end of summaries.
- Even when no relevant information exists, LLMs like GPT-4o generate convincing but fabricated summaries nearly half the time.
Manual evaluation of over 700 generated insights reveals the sources of hallucination: models often generalize excessively, ignore instructions, or fill in gaps with plausible-sounding but ungrounded content. While simple post-hoc filtering can reduce some errors, the deeper issues persist.
Our paper exposes an important limitation in today’s LLMs: as the input grows more complex, the line between synthesis and speculation becomes increasingly blurred.
This research was presented at NAACL 2025, Findings track.
Towards a More Holistic Understanding
Together, these three papers highlight the urgent need for more explicit, structured reasoning in LLMs handling multi-document tasks. Whether it’s aggregating insights, balancing competing priorities, or avoiding hallucinations, the underlying challenge is the same: enabling models to reason with clarity and control across large, unstructured contexts.
At Megagon Labs, solving these challenges is foundational to building trustworthy and capable AI systems. Our benchmarks and methods are now publicly available to encourage further research in this critical area.
Written By Pouya Pezeshkpour and Megagon Labs