LLM & NLP

Breakthroughs in LLMs have shifted NLP from task-specific methods to a generalized, data-driven approach, revolutionizing research and applications. Modern LLMs are increasingly being integrated with external tools, such as search engines, APIs, or symbolic reasoning systems to tackle complex tasks requiring specialized knowledge. However, their rise in usage has highlighted challenges in fairness, controllability, transparency, and explainability, which are especially critical qualities in domains like HR, legal, finance, and healthcare.

At Megagon Labs, we strive to harness the potential of LLMs while addressing these limitations. Our research focuses on three key areas: 

  1. Understanding LLM Behavior and Limitations: Investigating how LLMs perform and the challenges they face in real-world production use cases.
  2. Advancing LLM Capabilities: Developing novel systems, hybrid neuro-symbolic approaches, and domain-specific innovations to enhance LLM performance.
  3. Robust Evaluation Methods: Creating effective methods to assess LLMs on complex, real-world tasks, ensuring their reliability and effectiveness in diverse applications.

By leveraging these techniques, we aim to improve the quality, consistency, fairness, and truthfulness of AI solutions tailored for HR and related domains, driving impactful progress in both research and practical applications. Our work encompasses fundamental research, applied projects, and open-source contributions, ensuring that our innovations make a meaningful impact both within and beyond the lab.

Highlighted

Projects

We benchmark and investigate to understand when retrieval enhances LLM performance and when it may hinder it. Our insights contribute to the development of a reliable, retrieval-augmented language model-based QA system.

Investigation into LLM’s sensitivity in multiple-choice question answering – a task commonly used to study the reasoning and fact-retrieving capabilities of LLMs.

 

 

AmbigNLG

Addressing ambiguity in natural language generation (NLG) instructions by identifying unclear specifications and refining them for better output quality.

Less Is More Abstract

An innovative approach, “Extract then Evaluate,” to evaluate long document summaries using LLMs that not only significantly reduces evaluation costs but also aligns more closely with human evaluations.

 

Related

Publications

ACL
2026
Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimizationbased framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.
ICLR
2026
As language models gain access to external tools via structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another. Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding data leakage. We apply our FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other models. Performance declines sharply as dependency depth increases. Furthermore, connected irrelevant functions prove especially difficult to handle. We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding a success rate improvement from 62.5% to 81.3% for GPT-5.
SIGIR
2025
Businesses are increasingly overwhelmed by inquiries related to their services or products. Relying on human agents to handle inquiries via email results in higher costs and delayed responses, contributing to customer dissatisfaction. In response to these challenges, this pilot study leverages advancements in Large Language Models (LLMs) by proposing a fully automated method for generating a knowledge graph from unstructured data in help pages, which is then utilized to power a fully automated dialogue management system. By transitioning to a chat-based approach, our method aims to handle ambiguous, incomplete, or nonspecific inquiries more effectively and enhance customer satisfaction with tailored, natural responses. We also implement explicit safeguards to improve intent identification and prevent response hallucinations. We validate our proposal in the hotel industry, demonstrating that our knowledge graph based AI agent outperforms the baseline Retrieval-Augmented Generation (RAG) model in accuracy while facilitating more natural and coherent dialogues.
IWPT
2025
Hiroshi Matsuda, Chunpeng Ma, Masayuki Asahara
Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.
DATAI Workshop - VLDB
2024
Chenjie Li, Dan Zhang, Jin Wang
Detecting semantic types of columns in data lake tables is an impor- tant application. A key bottleneck in semantic type detection is the availability of human annotation due to the inherent complexity of data lakes. In this paper, we propose using programmatic weak su- pervision to assist in annotating the training data for semantic type detection by leveraging labeling functions. One challenge in this process is the difficulty of manually writing labeling functions due to the large volume and low quality of the data lake table datasets. To address this issue, we explore employing Large Language Mod- els (LLMs) for labeling function generation and introduce several prompt engineering strategies for this purpose. We conduct experi- ments on real-world web table datasets. Based on the initial results, we perform extensive analysis and provide empirical insights and future directions for researchers in this field.
EACL
2024
Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.
6 Min Read
November 20, 2025
Explore the key takeaways from COLM 2025, including breakthroughs in Reasoning & RL, Multimodal LLMs, and Retrieval & Embedding, as highlighted by Megagon Labs research scientists and engineer.
6 Min Read
November 7, 2025
“Mixed Signals,” exposes hidden biases in VLMs with major implications for healthcare, RAG systems, and AI safety.
5 Min Read
September 17, 2025
We share Megagon Labs’ key takeaways from ACL 2025 — highlighting the trends, debates, and breakthroughs shaping the future of NLP, agentic AI, and trustworthy evaluation.