The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) is the 30th and largest EMNLP conference to date. Gathering more than 6,000 participants, this edition presented work revolving around agentic systems, retrieval systems, improved interpretability of models, and alternative training methods. Our team contributed by presenting three papers during the conference: AIPOM, Mixed Signals, and Efficient Context Selection for Long-Context QA.
Keynotes
One of the keynotes presented during the conference, entitled “No more processing, it’s time to discover” by Dr. Heng Ji, investigated how and why LLMs should leverage science. Current attempts to apply AI in scientific research still fall well short of expectations. The tools remain overly manual, slow to adapt, and costly, leaving scientists with far less support than anticipated. Despite this inefficiency, current AI is still not helping scientists as much as hoped. She argues that most existing models either discover nothing or produce results that are not very useful. She introduced principles and techniques designed to make LLMs genuinely science-friendly. This process is necessary because current LLMs don’t follow the standard research pipeline. Discoveries involve multi-step pipelines, rather than directly outputting text. To illustrate this point, Dr. Ji pointed out that out-of-the-box LLMs struggle with understanding molecules for a simple reason: the nomenclature might confuse related atoms for distant ones and vice versa. Dr. Ji and her team’s work is based on a graph structure that permits them to mitigate this.
Another keynote by Dr. Hannaneh Hajishirzi is about open LLM practices. Competing with proprietary models demands more than raw compute power; it demands efficient, scientifically grounded pipelines. She described how her team is exploring novel paradigms in data selection, model architecture, and training to build truly open LLMs referencing works like OLMo and Tulu variants. Dr. Hajishirzi went on to explain that achieving these goals has enabled others to build on the work, inspiring additional scientific publications.
Information Extraction and Retrieval
General research directions in extraction and retrieval focused on enhancing RAG systems, developing efficient and robust retrieval mechanisms (especially for long contexts and specific entities), and leveraging multi-agent and knowledge graph (KG) approaches.
Here are some standouts that illustrate the evolving research trends present at the conference.
- ACC-RAG (Adaptive Context Compression for Retrieval-Augmented Generation) utilizes “smart skimming” to dynamically adjust the amount of information processed from retrieved documents based on query complexity, mirroring human reading behavior to enhance efficiency. This method preprocesses documents into multi-granular embeddings offline and uses a context selector to decide how many compressed embeddings are needed online, stopping once sufficient context is gathered.
- RTE-GMoE: introduces a graph-based mixture-of-experts framework that integrates entity recognition, relation extraction, and triplet extraction, enabling mutual reinforcement between subtasks rather than the typical cascading-error pipeline.
- Database-Augmented Query Representation for Information Retrieval enhances retrieval by enriching user queries with structured metadata from relational databases and encoding them through graph-based set representations, leading to significant improvement in short-query and entity-centric retrieval scenarios.
In addition to these, the rise of agent-based systems is increasingly evident by seeing more works like AgentMaster: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis. AgentMaster demonstrates how a modular multi-agent architecture can dynamically coordinate reasoning, query decomposition, tool-invocation, and routing for multimodal retrieval and answer generation. While not necessarily novel, it underscores the importance of multi-agent systems and its prevalence in all areas of LM tasks.
Interpretability
The focus of the presented works shifted from simply devising new explanation methods to understanding the internal mechanics of large language models. For example, in Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks, the authors showed that randomly zeroing out up to half of the embedding dimensions across six encoders and 26 tasks yields less than a 10% drop in performance. This finding suggests that large embedding spaces are significantly over-parameterised. In The Role of Outgoing Connection Heterogeneity in Feedforward Layers of Large Language Models, the authors probed feed-forward (MLP) layers of LLMs and discovered that neurons with diverse outgoing connection strengths mattered far more than uniform ones. In the paper, they introduced a fine-tuning loss to exploit this, improving performance and enabling structured pruning of neurons. Another paper, Pre‑trained Language Models Learn Remarkably Accurate Representations of Numbers presents a novel probe that decodes numeric values from input embeddings with near-perfect accuracy, showing that even though language models make arithmetic mistakes, their underlying numeric representations are highly precise, and aligning to that structure helps reduce errors. Together, these works reflect a trend toward deep transparency in model internals.
A collection of research focused on understanding LLMs even better. Large Language Models Do Multi-Label Classification Differently authors investigated how LLMs perform classification. The researchers found out that LLMs; predictions of next token ranking do not correlate with human classification for the second most probable token and beyond. They hypothesized that the pretraining of LLMs on next token prediction is the reason for such behavior.
Other noteworthy themes at this edition of the conference included uncertainty, with the conference also hosting the second iteration of the workshop dedicated to uncertainty quantification. The work Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models employed attention mechanisms to quantify uncertainties within LLMs. The authors of A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs proposed training an uncertainty quantification head concurrently with the text generation head. Another significant area of focus was model training:Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning introduced a method to mitigate catastrophic forgetting through the selective pruning of parameter updates.
Multimodality and Language Grounding to Vision
While work on multimodality spans many types of media such as images, video, and audio, the most relevant insights for text-centered NLP come from studying how vision and language interact. For example, Can VLMs Recall Factual Associations From Visual References? demonstrates that when models must rely on an image rather than a textual name, their ability to recall facts drops dramatically, highlighting weak visual grounding. Similarly, Mixed Signals: Decoding VLMs’ Reasoning and Underlying Bias in Vision-Language Conflict reveals that these models often prioritize one modality over the other, exposing biases in how they combine visual and textual cues. While the multimodal landscape is rapidly expanding to include richer media types, these works offer important food for thought when integrating multimodal signals into text-centric systems, especially around grounding, balance, and reliability.
Computational Social Science
In this track, researchers dug into how language models behave in social, economic, and decision-making contexts. The paper Evaluating and Aligning Human Economic Risk Preferences in LLMs assesses whether LLMs adapt their risk-seeking or risk-averse behavior to different personas and proposes alignment methods to better match human economic rationality. Such work underscores an emerging trend: it is important to view LLMs as social and economic agents whose choices and biases matter in real-world deployments.
AI Agents and Planning
The conference featured numerous contributions affecting AI/LLM agents.
The authors of Tool Preferences in Agentic LLMs are Unreliable by Faghih et al. found that LLM agents’ tool selection processes are unreliable, especially when choosing between tools with similar functionality descriptions. This paper demonstrates that strategically editing descriptions (e.g., adding cues or claiming active maintenance) could boost tool usage drastically. This bias underscores the need for a more reliable foundation for tool selection, such as grounding on historical usage data.
Within the domain of planning, one particular study A Good Plan is Hard to Find by Balepur et al. suggested that aligning models with user preference for planning may not be the optimal approach, as users’ notions of the most effective path to a solution may prove inaccurate.
On dialogue planning, One Planner To Guide Them All! by Dao et al. proposed PADPP, an adaptive policy planner to adapt policy in dialogue without having to retrain the model. It can adapt to multiple objectives, goal-oriented dialogue with only one training.
Select-Then-Decompose by Liu et al. systematically categorized LLM task decomposition behaviors across different dimensions including decomposition-first vs. interleaved execution, implicit vs. explicit LLM invocation, and DAG vs. linear plan structure. By analyzing representative models across these dimensions, the study confirmed a performance-cost dilemma and revealed that task characteristics determine the optimal decomposition strategy. The authors also proposed “Select-Then-Decompose,” an adaptive strategy that dynamically selects the most suitable decomposition approach.
Human-AI
Mysore et al. explored how people interact with large language models during real-world writing tasks via large-scale analysis of actual conversation data. They identified common human-LLM collaboration patterns, such as restating requests, while noting that providing explicit feedback is rare. They also showed that user behavior differs by intent—for example, requesting multiple outputs for brainstorming or adding lengthy text to personalize responses. Users treated LLMs as “thought partners,” rather than merely as assistants.
AI Chatbots as Professional Service Agents by Li et al. proposed LAPI, a framework that gives LLM agents a well-defined professional identity. Its task planning is grounded in health behavior theory, while an entropy-based mechanism ensures the agent’s responses follow the pragmatic rules expected of real healthcare professionals. This approach represents a step forward in transforming generic chatbots into specialized, professional AI assistants.
RL and Training
Among works on RL and training, some focused specifically on the attention mechanism. KLAAD by Kim et al. focused on refining it to reduce social bias in generative language models. They proposed an attention-based debiasing framework that aligns attention distributions between stereotypical and anti-stereotypical sentence pairs to reduce bias in generative models. The authors demonstrated that this method improves fairness metrics across several models and benchmarks.
Similarly, ATTUN by Bussotti et al. focused on improving explainability and robustness of the fact-checking models, through an indirect refining of attention provided by an additional module.
Model unlearning is a popular topic, as evidenced by several papers presented on it at the conference. OBLIVIATE by Xu et al. introduced a masked loss for the forget set, and distillation and world fact losses for the retain set. To mitigate the aggressive forget loss that provokes catastrophic forgetting, the team utilized the two other losses based on the retain set.
LLM Safety
Finally, some works focused on LLMs’ behaviors. LLM safety is a subtopic of behavior. Ding et al. presented TombRaider. The authors developed a dual‑agent jailbreak framework that circumvents safety mechanisms by leveraging historical knowledge about harmful figures and actions: an inspector agent locates such references, and an attacker agent then prompts the model for detailed harmful information. Across six state‑of‑the‑art LLMs, the method achieves nearly 100% attack success on undefended models and remains highly effective even against safety‑enhanced ones. Similarly, DAMON by Zhang et al. proposed a novel multi-turn jailbreak attack framework that uses Monte Carlo Tree Search to dynamically explore conversational paths and identify sub-instruction sequences that guide the LLM in answering the original harmful question.
Research Direction & Our Contributions/Aspirations
EMNLP shows growing interest in multi-agent systems. Blue, our open-source framework for agentic workflows, aligns with that growth area. Improving the trustworthiness of systems also corresponds with our recent works. These areas and related topics, such as reasoning and human-AI collaboration, are key domains that Megagon Labs continues to investigate and promote in its leading-edge NLP research.
Written by Moin Amin-Naseri, Jean-Flavien Bussotti, and Hannah Kim