ACL 2025 Highlights: Direction of NLP & AI

We share Megagon Labs’ key takeaways from ACL 2025 — highlighting the trends, debates, and breakthroughs shaping the future of NLP, agentic AI, and trustworthy evaluation.

The 63rd annual meeting of the Association for Computational Linguistics (ACL 2025) took place in Vienna, Austria, from July 27 to August 1, 2025, bringing together thousands of NLP researchers, practitioners, and industry leaders from around the world. As tradition, Megagon Labs was proud to sponsor this year’s conference, which featured generalization of NLP models as one of the central themes. Our researchers also actively participated in the conference, presenting their latest works – CypherBench and FactLens.

With over 5,000 participants and more than 3,100 papers, ACL 2025 was a hub of ideas and innovation. In this post, researchers at Megagon Labs highlight noteworthy trends shaping the future of NLP and AI, including generalization, agentic planning, and LLM-based evaluation. A comprehensive report from the program chairs is available for readers seeking additional insights.

Generalization of NLP Models

Luke Zettlemoyer’s keynote speech emphasized three crucial points: the significance of pre-training LLMs, the promise of tokenizer-free LLMs, and the potential of modular language models to improve flexibility and adaptability across domains.

Additionally, a panel discussion emphasized the importance of generalization for ensuring that models remain robust, reliable, and fair when making predictions on data that differs from their training set. Strong generalization is especially critical for real-world applications, where models are expected to exhibit human-like adaptability. Just as humans naturally generalize from prior experience, AI systems should strive to achieve a similar level of flexibility and consistency.

Agents & Planning

A significant portion of work presented at the conference emphasized the rise and utility of agentic NLP systems. Examples included LLM models that do more than generate text, as well as those that are capable of planning, reasoning, and collaborating across multiple steps.

MegaAgent proposed a large-scale autonomous system, designed to operate without predefined standard operating procedures (SOPs). Unlike earlier frameworks that rely heavily on rigid task specifications, MegaAgent showcased how agents can dynamically coordinate tasks, adapt to unexpected inputs, and distribute responsibilities among themselves. This flexibility makes it especially promising for open-ended, real-world applications.

Qiao et al. (2025) explored how agents can benefit from introspection and knowledgeable self-awareness. The system proposes mechanisms for fast-thinking, slow-thinking or reasoning, and external-knowledge gathering, allowing agents to evaluate the adequacy of their own knowledge before acting. This kind of meta-cognition is key to building trustworthy agents capable of recognizing their limits.

Other research highlighted collaborative and orchestration capabilities among multi-agent systems, pointing to a future where LLM-powered agents not only act autonomously but also coordinate intelligently – with each other and also with humans-in-the-loop – to solve complex tasks.

LLMs in the Age of Generation & Evaluation

Evaluation and data generation were also highly discussed topics at ACL 2025. A tutorial on synthetic data generation using LLMs highlighted the cost benefits, potential scalability, and the efficacy of LLM-generated data. Several existing methods to generate synthetic data were discussed, including sampling-based generation, back-translation, transformation of existing data, human-AI collaboration, and symbolic generation. A significant focus was on the curation and/or creation of diverse data, which improves the robustness and performance of LLM models that are trained on such data. The tutorial further elaborated on the limitations of LLM-generated data, specifically regarding the gap in richness and quality compared to real-world data. The tutorial also raised concerns regarding the legality and ethics of synthetic data. 

Calderon et al. (2025) tackled the relevant question of statistically justifying the efficacy of LLM-based evaluations and whether they can replace human annotators. The paper proposed a rigorous framework that compares annotator consistency against LLM judgments, offering evidence that, under certain conditions, LLMs can serve as reliable stand-ins. This not only reduces costs but also accelerates evaluation cycles.

Another creative approach was proposed by Sandan et al. (2025), which reframed evaluation as a tournament of pairwise comparisons. Instead of scoring outputs independently, models compete in head-to-head comparisons judged by an LLM. Over several rounds, the stronger outputs emerge, giving a more nuanced view of quality differences. This iterative framework proved effective across summarization, dialogue, and reasoning tasks, where subtle distinctions in quality often matter most.

Text to SQL

Several works at ACL 2025 examined how direct preference optimization (DPO) can be made effective for Text2SQL by incorporating richer reasoning — especially via chain-of-thought (CoT) signals. Liu et al. showed that vanilla DPO applied to Text2SQL often fails or degrades performance when models only see final SQL answers (i.e., no intermediate reasoning). By augmenting Text2SQL datasets with synthetic CoT solutions, they obtain consistent, significant improvements. CoT helps reduce reward-hacking, strengthens discrimination between correct and incorrect outputs, and improves scalability. Complementing that, Zhai et al. proposed a two-stage DPO framework called ExCoT that includes an off-policy DPO and an on-policy DPO stage. Experiments showed consistent gains: on BIRD and Spider, using LLaMA-3.1-70B, both stages led to improvement over vanilla SFT. 

Another axis of progress is multilingual text-to-query beyond English. The MultiTEND paper introduced a large multilingual benchmark (six languages: English, German, French, Russian, Japanese, Mandarin Chinese) for translating natural language into NoSQL queries. The questions, database schema, and database content were provided in multiple languages. They identified both lexical and structural challenges across languages — for example, differences in syntactic structure (e.g., word order) and schema linking difficulties (e.g., how fields, table names, and operators map across languages).

Our Takeaways & Further Direction

ACL 2025 highlighted how rapidly our field is evolving and pushing beyond language modeling toward generalization, agentic planning, and trustworthy evaluation. These are not only academic milestones but also critical foundations for real-world systems that must learn, reason, adapt, and collaborate with humans. 

For Megagon Labs, the discussions reinforced our strategic focus. Our work on CypherBench and FactLens demonstrates the importance of benchmarks that ensure robustness, transparency, and adaptability, which are essential for enterprise adoption. Likewise, our investment in Blue, our open-source framework for agentic workflows, aligns with the growing momentum around multi-agent systems and orchestration. By enabling agents to reason, plan, and collaborate in enterprise settings, we help operationalize ideas that took center stage at ACL 2025.

Equally important, debates on evaluation and synthetic data resonated with our belief that progress in AI depends not only on stronger models but also on better ways to measure and validate them. The community’s exploration of LLM-based evaluation, tournament-style comparisons, and synthetic data generation reflects the same challenges we face in scaling trustworthy AI for diverse, global applications. These are areas where we will continue to deepen our contributions, drawing on our expertise in data curation, evaluation frameworks, reasoning, and human–AI collaboration. As academic research, open-source efforts, and industry deployment grow increasingly intertwined, Megagon Labs will remain at this intersection, pushing NLP research forward while translating innovation into solutions that impact millions worldwide.

Share this article
8 Min Read
December 16, 2025
Research directions presented at EMNLP 2025 span agentic systems, retrieval, interpretability, multimodality, training, and human–AI interaction, including work contributed by Megagon Labs.
6 Min Read
November 20, 2025
Explore the key takeaways from COLM 2025, including breakthroughs in Reasoning & RL, Multimodal LLMs, and Retrieval & Embedding, as highlighted by Megagon Labs research scientists and engineer.
6 Min Read
November 7, 2025
“Mixed Signals,” exposes hidden biases in VLMs with major implications for healthcare, RAG systems, and AI safety.