EMNLP 2024 Highlights

EMNLP 2024 took place in Miami, USA, from November 12–16, with Megagon Labs as a sponsor. A team of our researchers actively participated in the conference, presenting their latest work. This year’s theme track was “Efficiency in Model Algorithms, Training, and Inference,” offering a platform for researchers to address critical challenges in optimizing model efficiency. Key areas of focus included quantization, data requirements, and model size. The conference attracted 4,100 participants and featured an impressive 3,100 papers across various tracks, including the main conference, findings, industry, demos, and workshops.

With the continued rise of large language models (LLMs), the most prominent topics at the conference included natural language processing (NLP) applications, resources and evaluation, and model interpretability and analysis. A comprehensive report from the program chairs is available for readers seeking additional insights. In this blog post, researchers at Megagon Labs highlight noteworthy research in areas of interest such as agentic systems, AI safety, and human-centered AI.

Agentic Systems

Within AI systems, an increasing interest has been shown in building LLM-powered agents. With their increased language understanding, generation, and reasoning capabilities, such synchronized agentic systems are proving themselves valuable in various end-to-end applications. 

At EMNLP, several works focused on utilizing LLM-powered agentic systems to solve a variety of tasks such as hallucination detection, cryptocurrency trading, building no-code developer & debugging tools, and harmful meme detection, among others. A few works highlighted the merits of collaborative human-agent collaboration to solve complex tasks such as reasoning and planning, while others proposed building synchronized and fully autonomous agentic workflows. Still, other research demonstrated a toolkit that supports recursive multi-agent systems whereby agents, themselves, flexibly delegate tasks to other agents. A large focus was also on improving LLM-based agentic systems through iterative or self-learning evidenced by the following papers: Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement, LLM-Based Offline Learning for Embodied Agents via Consistency-Guided Reward Ensemble, and METAREFLECTION: Learning Instructions for Language Agents using Past Reflections

A tutorial on language agents broadly covered the prospects and risks of such systems. The tutorial highlighted how language agents differ fundamentally from previous AI agents through their unique ability to use language for both thought and communication. This dual use of language enables them to tackle complex tasks by leveraging intermediate reasoning, planning multiple steps ahead, and orchestrating different tools despite their heterogeneous interfaces. The tutorial also emphasized important challenges around grounding—ensuring agents can properly interpret language in specific contexts and environments—and the critical need for robust tool augmentation that minimizes unintended behaviors. While these capabilities create exciting opportunities for autonomous systems, the tutorial underscored key risks around hallucination, bias, and transparency that need careful consideration as language agents become more prevalent in real-world applications. An emerging focus is also on building multimodal agents, as proposed by OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents (Best Demo Paper Award) which integrates speech-to-text, emotion detection, and LLMs into a user-customizable agentic workflow.

Bayes in the GenAI Era

Keynote by Tom Griffiths. The rapid evolution of artificial intelligence has been fueled by breakthroughs in neural architectures, driving the development of intelligent systems. Yet, as these systems become more sophisticated, researchers face a new challenge: understanding how they operate. In his keynote talk titled “Bayes in the age of intelligent machines,” Tom Griffiths explored the role of abstract principles of intelligence, such as Bayes’ rule, in modern AI development.

In his talk, Tom introduced a novel perspective on Bayes’ rule, emphasizing its application as an abstract framework for understanding how agents solve problems. Drawing parallels with its role in cognitive science, Bayes acts as a tool to analyze and interpret the behavior of AI systems. This approach is especially relevant as we increasingly encounter intelligent systems whose inner workings remain opaque, making the task of an AI researcher more similar to that of a cognitive scientist studying human thought.

QUITE Benchmark. Schrader et al. (2024) aim to advance Bayesian reasoning in generative AI through a novel question-answering dataset, QUITE. Bayesian reasoning, which operates over uncertain premises and observations to infer probabilities, is a longstanding challenge in artificial intelligence. Existing datasets often simplify the problem by restricting tasks to binary random variables or constrained textual formats. Schrader et al. (2024) address these limitations with QUITE, a dataset designed to emulate real-world scenarios requiring nuanced probabilistic reasoning.

QUITE includes tasks involving three types of Bayesian reasoning: causal, evidential, and explaining-away. These tasks require models to estimate probabilities based on natural language premises, evidence, and questions. The dataset incorporates categorical random variables and complex interdependencies, moving beyond earlier datasets that relied on binary variables or textual ranking tasks.

Experimental results show that logic-based models outperform large language models across all reasoning types in QUITE, highlighting the limitations of current generative AI systems in probabilistic reasoning.

AI Safety

With the rapid advancement and growing prevalence of LLMs, ensuring their safe and ethical development becomes paramount. At EMNLP 2024, this vital imperative was addressed, with many studies tackling issues ranging from technical vulnerabilities to the ethical implications of their widespread adoption.

One significant area of exploration was the stability of LLMs’ core components. Research on under-trained tokens, or “glitch tokens,” introduced effective methods to identify and mitigate these problematic tokens, which can lead to unintended behaviors. Other work on LLMs’ ability to detect and recover from “silent” tool failures proposed strategies for integrating tools more reliably across tasks like calculations and planning.

Another important theme was aligning LLM outputs with societal norms and values. Studies demonstrated that preference tuning using English data alone can dramatically reduce toxicity in multilingual LLMs, ensuring safer outputs across 17 languages. Other research examined age bias in LLMs, uncovering a tendency to align more closely with younger demographics and emphasizing the need for equitable value representation.

Human-centered AI

This year’s EMNLP continued to showcase exciting advancements in human-centered AI, with researchers exploring diverse areas such as enhanced LLM evaluation, the integration of multimodal human feedback, and the development of more nuanced conversational interfaces. While we can’t cover every noteworthy contribution, we’d like to highlight some particularly interesting work that caught our attention.

The paper “Global Reward to Local Rewards: Multimodal-Guided Decomposition for Improving Dialogue Agents” tackles the challenge of training dialogue agents using reinforcement learning. Traditionally, these agents receive a single reward at the end of a conversation, which makes it difficult to learn nuanced behaviors. This paper proposes a method to decompose this global reward into local, turn-level rewards by leveraging multimodal cues like facial expressions and gestures. This approach reflects a growing interest in incorporating human preference and feedback in different modalities to enhance language understanding and generation.

The next paper, “Beyond Reference: Evaluating High Quality Translations Better than Human References” addresses the limitations of traditional translation evaluation metrics that rely on human-generated references. As machine translation systems surpass human performance in certain scenarios, these metrics become inadequate. This paper introduced a residual scoring metric which treated the reference as “neutral” rather than “perfect”, and considered the relative quality between reference and candidates. Finally, “Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers” focuses on improving conversational question-answering systems. The authors proposed a novel query rewriting technique that adapts to the specific retriever used in the system. By aligning the rewriter with the retriever’s preferences, they achieved better retrieval performance without relying on extensive annotations.

Conclusion

EMNLP 2024 proved to be an exceptional gathering, showcasing groundbreaking advancements and fostering meaningful discussions in the ever-evolving field of NLP. We extend our gratitude to the organizers and participants for making this conference a resounding success. Inspired by the presentations and interactions, the Megagon Labs team has curated this blog post to spotlight key developments in agentic systems, AI safety, and human-centered AI. These trends underscore the innovative efforts researchers are pursuing to enhance the efficiency and impact of NLP technologies. To stay updated on the latest in NLP, machine learning, and AI, connect with us on LinkedIn, X, or Facebook.

Share:

More Blog Posts: