COLM 2025 Highlights & Research Direction

Explore the key takeaways from COLM 2025, including breakthroughs in Reasoning & RL, Multimodal LLMs, and Retrieval & Embedding, as highlighted by Megagon Labs research scientists and engineer.

COLM 2025 touched down in Montréal, Canada, on October 7–10, 2025, bringing together a rapidly expanding community and a record 418 accepted papers. The program felt like a snapshot of where the field is heading next: reinforcement learning for reasoning and self-improvement, genuine multimodal models that use pixels rather than just describe them, and a renewed focus on data curation, pretraining dynamics, and live, drifting evaluations. Alongside these, we saw strong pushes on retrieval and embedding geometry, efficiency and compression, controlled generation, and new benchmarks that measure what actually matters. In this blog post, researchers at Megagon Labs highlight these developments and the papers that shaped them at COLM.

Best Papers (spotlights and awards)

Several research clusters emerged among the top papers at COLM 2025. Reinforcement learning and self-improvement received significant attention, with work such as examining biases in RL optimization methods and identifying cognitive behaviors essential for effective self-improvement. Multimodality was another major focus, with papers investigating the interpretability of multimodal embedding spaces, diagnosing why VLMs underutilize their visual representations, and exploring how language models align with brain regions representing cross-modal concepts. Data curation and pretraining emerged as a third theme, including investigations into how models learn factual knowledge, the origins of cognitive biases in pretraining, and novel multilingual data processing pipelines.

Beyond these clusters, the top papers reflected remarkable diversity across the field, touching on model efficiency and compression, novel evaluation methodologies and benchmarks , controlled generation techniques, and the geometric structure of embedding spaces.

Retrieval and Embedding

The COLM 2025 conference showcased significant advances in embedding and retrieval techniques, reflecting a strong trend towards enhancing the synergy between large language models and retrieval systems. Several papers focused on creating dedicated embedding models for retrieval tasks, such as ReasonIR, which used a novel synthetic data generation method to create a model specialized in retrieval for reasoning-based tasks, and CodeXEmbed, which achieved state-of-the-art results in multilingual code retrieval through novel contrastive learning.

Another major research thrust focused on understanding the geometric and semantic structure of embedding spaces. Work on vision-language models used sparse autoencoders to uncover stable, sparse linear “concept” directions within the embedding spaces of vision-language models, revealing shared cross-modal semantics. Complementing this, research on language model geometry demonstrated that models within the same family share both global and local geometric structures in their token embeddings and proposed a method to translate steering vectors across families.

Reasoning and RL

Reasoning with RL at COLM is less about chasing a single reward and more about shaping how models think. As mentioned earlier, the momentum here is clear—this line of work even earned a Best Paper award . Beyond that, a careful look at R1-Zero–style training dissects GRPO’s optimization quirks and offers fixes while still posting strong results, clarifying where pure-RL gains come from—and where they crack. In parallel, L1/LCPO introduces length-controlled policy optimization: You train models to target a desired “thinking length,” allowing you to trade compute for accuracy and even keep short-reasoning policies competitive. Together, these papers elevate questions like “how long to think?” and “when to decide?” to first-class RL objectives rather than happy accidents.

A second thread focuses on the “data and feedback economy” behind RL for reasoning. Off-Policy Corrected Reward Modeling (OCRM) counters the off-policy drift common in RLHF by using importance weighting, tightening reward fidelity without collecting new labels, and improving downstream policies. Active Exploration turns preference querying into an active contextual dueling bandit problem, boosting label efficiency in both online and offline alignment. And a synthetic-data + multi-step RL pipeline trains step-decomposed trajectories for tool use and multi-hop QA, yielding broad gains and cross-task transfer—evidence that multi-step alignment can scale with less human supervision . Despite the tremendous progress in RL and reasoning showcased at COLM, many open problems remain, including building judge-resistant rewards for stepwise reasoning, stabilizing long-horizon RL in language, controlling test-time compute with quality guarantees, transferring step-policies across domains and tools, and establishing standardized multi-step evaluations that discourage reward hacking.

Multimodal LLMs

How VLMs actually use (or ignore) pixels was another standout theme. As noted earlier, this isn’t just an anecdotal observation—the trend itself earned a Best Paper award at COLM. Complementing that view from the cognitive side, new evidence links LM predictions to brain areas that represent concepts consistently across modalities, suggesting internal representations closer to cross-modal meaning than to surface text patterns alone. Meanwhile, another paper argued that true multimodal in-context learning needs mechanisms that actively reallocate attention toward visual tokens; their DARA method and a purpose-built TrueMICL set show sizable gains when models must actually use the image context rather than just parrot text patterns. Together, these papers push beyond leaderboard bumps to ask whether models are genuinely integrating vision and language, and to instrument that integration with better probes, training signals, and evaluation design. 

The other half of the story is evaluation, alignment, and efficiency. MAC proposes a live benchmark built from thousands of recent journal covers (Nature/Science/Cell), probing visual–textual scientific reasoning as both the literature and models evolve; they also show a lightweight inference trick (DAD) that improves scores, underscoring how brittle cross-modal reasoning still is in the wild. For alignment, VaPR curates hard-negative preference data that avoids superficial style/length cues and measurably boosts VLM reasoning across diverse suites—useful signal if you care about robust, visual grounding rather than yes/no bias. And efficiency isn’t just a “nice to have”: SmolVLM demonstrates sub-1GB inference footprints and competitive image/video understanding with 256M–2.2B-parameter models, pointing to practical, edge-ready multimodality. Despite tremendous progress in multimodal LLMs at COLM, several open problems remain: closing the vision-usage gap on vision-centric tasks (measuring and enforcing reliance on pixels over priors); scalable, live evaluations that track real-world drift (with versioned, regularly refreshed test sets); preference data that truly stresses visual reasoning, not style (counterfactual, image-dependent pairs); efficient architectures that keep accuracy under tight memory/latency budgets (adaptive routing and pruning); robust multimodal safety and jailbreak resistance; and broader modality/temporal grounding (audio/video with causal, time-aware reasoning).

Keynotes

COLM’s plenary lineup paints a clear picture of where language modeling is headed and why it matters. Luke Zettlemoyer makes the case for “mixed-modal” models that can fluidly generate text and images together, arguing we should move past “tokenize everything” and toward hybrid transformer–diffusion systems (like Chameleon and Transfusion) that unlock richer multimodal reasoning and tool use. His talk builds on the vision he shared in his ACL 2025 keynote earlier this year, which focused more on training data—a pair of talks well worth revisiting. Tom Griffiths brings a cognitive-science lens, showing how human limits can actually predict where today’s LLMs and VLMs will stumble—turning those “jagged edges” into testable hypotheses. We also encourage readers to revisit his EMNLP 2024 keynote, which nicely complements his COLM 2025 presentation. And Nicholas Carlini zooms in on risk, placing everyday harms, near-term misuse, and longer-term existential worries on a single mitigation continuum, and urging the community to tackle under-explored defenses. Taken together, these keynotes push for universal multimodal models, theory-guided evaluation, and risk-aware progress—not as competing goals, but as a combined roadmap. 

The program then widens to science and society. Shirley Ho introduces “Polymathic AI,” a foundation-model effort built for scientific data (not just text), designed to learn from heterogeneous, high-dynamic-range datasets and to carry useful priors for concepts like measurement, causality, and wave-like behavior—backed by new large-scale corpora such as the “MultiModal Universe” and “The Well.” Gillian Hadfield reframes alignment as a social problem, advocating norm-competent agents, reinforcement learning for discursive justification, and jury-style oversight that mirrors real institutions. Finally, an “Open LLMs in the Reasoning Era” panel—moderated by Sasha Rush with Azalia Mirhoseini, Junyang Lin, and Eric Wallace—asks how open models will compete (and collaborate) on reasoning, tooling, and benchmarks. The through-line is practical and optimistic: build models that reason across modalities, test them with human-grounded theory, govern them with real-world norms, and make them genuinely useful for science and society.

Conclusion

COLM 2025 was a standout meeting, rich with new ideas and lively debate across the fast-moving NLP landscape. We’re grateful to the organizers and attendees for an exceptional meeting. In this post, the Megagon Labs team highlights what stood out to us: advances in Retrieval and Embedding, Reasoning and RL, and Multimodal LLMs. Taken together, these threads point to how researchers are pushing for more capable, efficient, and useful NLP systems. For more updates on NLP, machine learning, and AI from our team, connect with us on LinkedIn and X.

Share this article
8 Min Read
December 16, 2025
Research directions presented at EMNLP 2025 span agentic systems, retrieval, interpretability, multimodality, training, and human–AI interaction, including work contributed by Megagon Labs.
6 Min Read
November 7, 2025
“Mixed Signals,” exposes hidden biases in VLMs with major implications for healthcare, RAG systems, and AI safety.