Less Is More for Long Document Summary Evaluation by LLMs

In the realm of text generation and summarization, the evaluation of generated summaries, especially for long documents, has always been a challenging task. The traditional methods often struggle with high computational costs and the “Lost-in-the-Middle” problem, where crucial information in the middle of long documents is frequently overlooked by the model. To address these challenges, we conduct a study to evaluate long models using an innovative approach that not only significantly reduces evaluation costs but also aligns more closely with human evaluations.

The Extract-then-Evaluate Method

The core of our approach, called “Extract then Evaluate,” lies in its simplicity and effectiveness. Instead of evaluating the entire long document, this method focuses on extracting key sentences from the source document and then evaluating the summary based on these extracted sentences. By doing so, it effectively addresses the “Lost-in-the-Middle” problem and significantly reduces the computational resources required for evaluation.

Summary evaluation framework

Key Contributions

  1. Cost-Effective Evaluation: By concentrating on key sentences, the method drastically cuts down the computational costs associated with evaluating long documents.
  2. Higher Correlation with Human Evaluations: This approach exhibits a higher correlation with human evaluations, making it a more reliable method for summary evaluation.

Experimentation and Results

We performed extensive experiments across various datasets, including arXiv, GovReport, PubMed, and SQuALITY. The experiments explored different sentence extraction methods, such as LEAD, ROUGE, BERTScore, and NLI, to determine the most effective approach for the Extract-then-Evaluate method. The results were promising, showing that the proposed method reduced evaluation costs and improved the alignment with human evaluations compared to existing automatic metrics.

Implications and Future Directions

Our study represents a notable progression in text generation evaluation. Its impact goes beyond academia, providing practical solutions for industries with a need for summarizing lengthy documents, including legal document analysis, medical report summarization, and news aggregation.

Looking ahead, the study opens up new avenues for further research, particularly in exploring more sophisticated sentence extraction methods and extending the approach to other forms of text generation tasks. Additionally, the study highlights the potential of leveraging large language models (LLMs) in a more cost-effective and accurate manner, suggesting a promising direction for future developments in AI and NLP technologies.


In conclusion, the Extract-then-Evaluate approach is a significant leap forward in evaluating long document summaries. By addressing key challenges such as high computational costs and the Lost-in-the-Middle problem, this method not only enhances the efficiency and accuracy of summary evaluations but also aligns closely with human judgments. As we move forward, it will be exciting to see how this approach can be further refined and applied across various domains, contributing to the advancement of text generation technologies.

For those interested in diving deeper into this innovative approach, we have made the code available on GitHub, inviting the community to engage with our research’s work and contribute to developing more effective text-generation evaluation methods.

Explore the Code 

Stay tuned for more insights and breakthroughs in the world of natural language processing by following our blog.

Written by: Hayate Iso and Megagon Labs

Follow us on LinkedIn and Twitter to stay up to date with new research and projects.



More Blog Posts: