Beyond the Single Document: Advancing Multi-Document Reasoning with LLMs

We present three new papers that tackle a pressing and underexplored topic in NLP: multi-document reasoning. These works offer rigorous benchmarks, novel methodologies, and empirical insights into how large language models (LLMs) handle complexity across multiple sources of information.

As language models scale in size and capability, so do our expectations for how they process and understand information. Today’s real-world problems rarely involve isolated paragraphs—they require reading, comparing, and reasoning over multiple documents, often of varying length, format, and quality.

We present three new papers that tackle a pressing and underexplored topic in NLP: multi-document reasoning. These works offer rigorous benchmarks, novel methodologies, and empirical insights into how large language models (LLMs) handle complexity across multiple sources of information.

Let’s take a closer look at what these contributions reveal—and how they collectively push the boundaries of multi-document understanding.

Holistic Reasoning with Long-Context LMs: Introducing HoloBench

Modern tasks demand more than just retrieving small amounts of relevant snippets; they require models to synthesize and reason across the entire corpora. Retrieval-augmented generation (RAG) systems have taken us far in accessing relevant content, but they often stumble when asked to aggregate and make sense of information scattered across documents.
Long-context Language Models (LCLMs) offer the premise of “reading” large swaths of information all at once. But how well do they reason over this data?

In “Holistic Reasoning with Long-Context LMs” we introduce HoloBench, a benchmark designed to evaluate whether LCLMs can perform database-style operations—like filtering, aggregation, and comparison—on unstructured text. It’s a systematic framework that varies key factors such as:

  • Context length

  • Information density

  • Distribution of relevant information

  • Query complexity

The findings are eye-opening: LCLM performance is more sensitive to how much information (i.e., the amount of information in the context that is relevant to the query) is packed into the context than to the length of that context. Moreover, tasks requiring aggregation of multiple facts across the input lead to noticeable performance drops, especially as query complexity increases.
These insights reveal a critical bottleneck in current LCLMs and point to where future work must focus: not just expanding context windows but developing frameworks that facilitate more effective reasoning over large contexts.

This paper was presented at #ICLR2025.

Multi-Conditional Ranking with LLMs

In many cases, we require multi-document reasoning by selecting, ordering, and ranking items based on numerous, sometimes conflicting conditions. This is a common scenario in recommendation systems, policy generation, and any domain where tradeoffs must be made.
In “Multi-Conditional Ranking with Large Language Models,” we introduce MCRank, a benchmark that challenges LLMs to rank a set of items according to multiple conditions.

Baseline evaluations reveal a consistent trend: performance degrades rapidly as the number and complexity of conditions increase.

To address this, the paper proposes EXSIR, a decomposed reasoning approach:

  1. Extract the ranking conditions
  2. Sort them into a logical order based on their priority
  3. Iteratively Rank items while reasoning through each condition

EXSIR achieves an up to 14.4% performance gain over baseline models, significantly outperforming chain-of-thought prompting and standard ranking approaches.
The takeaway? Decomposition matters—and how we structure the reasoning process can make all the difference for complex multi-document tasks.

This paper was presented at the NAACL 2025 Main Conference.

From Single to Multi: Understanding Hallucinations in Multi-Document Summarization

As summarization moves beyond single documents, new challenges emerge—among them, the escalation of hallucinations. Unlike single-document settings, where grounding is relatively straightforward, multi-document summarization introduces ambiguity, conflicting facts, and a wider space for errors.

In “From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization,” we investigate how LLMs hallucinate when summarizing information across documents. Since no existing benchmark tackled this directly, the team created two new benchmarks on top of an existing dataset: one based on news articles and another on multi-speaker conversations, each annotated with topic-specific insights.

The results are striking:

  • Up to 75% of generated summaries contain hallucinated content.
  • Hallucinations tend to cluster toward the end of summaries.
  • Even when no relevant information exists, LLMs like GPT-4o generate convincing but fabricated summaries nearly half the time.

Manual evaluation of over 700 generated insights reveals the sources of hallucination: models often generalize excessively, ignore instructions, or fill in gaps with plausible-sounding but ungrounded content. While simple post-hoc filtering can reduce some errors, the deeper issues persist.

Our paper exposes an important limitation in today’s LLMs: as the input grows more complex, the line between synthesis and speculation becomes increasingly blurred. 

This research was presented at NAACL 2025, Findings track.

Towards a More Holistic Understanding

Together, these three papers highlight the urgent need for more explicit, structured reasoning in LLMs handling multi-document tasks. Whether it’s aggregating insights, balancing competing priorities, or avoiding hallucinations, the underlying challenge is the same: enabling models to reason with clarity and control across large, unstructured contexts.

At Megagon Labs, solving these challenges is foundational to building trustworthy and capable AI systems. Our benchmarks and methods are now publicly available to encourage further research in this critical area.

Written By Pouya Pezeshkpour and Megagon Labs

Share this article
7 Min Read
May 8, 2025
How can enterprise systems evolve to support agentic workflows? In this post, we explore the conceptual foundations of Blue—a framework designed to integrate AI agents, data, and services into scalable, observable, and controllable enterprise applications.
5 Min Read
April 1, 2025
We present Blue v0.9, our open-source framework for building and deploying agentic workflows in enterprise environments. Unlike conventional AI frameworks, Blue is designed with enterprise-scale requirements in mind—scalability, observability, configurability, and seamless integration with existing infrastructure.
6 Min Read
February 5, 2025
With the MCRank benchmark and our EXSIR method, we’ve shown that LLMs can significantly improve their performance on these challenging tasks when guided by structured reasoning.