Tool-augmented Language Models (TaLMs)
can invoke external tools to solve problems
beyond their parametric capacity. However,
it remains unclear whether these tool-enabled
gains reflect trustworthy reasoning. Focusing
on the Code Interpreter tool, we show that even
when tools are selected and executed correctly,
TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct
but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and
study it using PYMATH, a benchmark of 1,679
competition-level mathematical problems for
which Python code is helpful but not sufficient.
We further develop a multi-dimensional evaluation suite to quantify reasoning degradation
in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs
achieve up to a 19.3 percentage point gain in
final-answer accuracy, their reasoning behavior
consistently deteriorates (e.g., non-tool LLMs
win up to 41.5% more often in pairwise comparisons of reasoning process). This degradation
intensifies with tool use; the more frequently a
model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity);
with TIM present in ~55% of high-risk cases.
Finally, we propose a preference-optimizationbased framework that realigns TaLMs to use
tools as assistive evidence, improving both
final-answer accuracy and reasoning depth under tool use. Codes and data are available at:
https://github.com/megagonlabs/TIM.
Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.
Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent detection a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning. We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that assesses planning utility given the rewritten intent. Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines, in terms of plan preference. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agentic planning in open-domain dialogue systems.
As language models gain access to external tools via structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another. Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding data leakage. We apply our FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other models. Performance declines sharply as dependency depth increases. Furthermore, connected irrelevant functions prove especially difficult to handle. We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding a success rate improvement from 62.5% to 81.3% for GPT-5.
Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce RePairTQA, a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.
Skill mapping is a key task in the Human Resources domain. It consists in identifying ontology-defined skills in job texts. Among the most successful approaches applied to skill mapping, bi-encoders offer efficient inference but struggle with fine-grained skill distinctions, particularly under limited supervision. While accurate, cross-encoder and LLM-based reranking approaches are computationally expensive and usually not feasible to be adopted in real case scenarios. We propose kNNBE, a hybrid inference method that augments bi-encoder similarity scores with k-nearest labeled sentences drawn from a synthetic memory bank. kNNBE improves both prediction accuracy and generalization to unseen skills while retaining high throughput. Extensive experiments on three benchmark datasets show that kNNBE rivals state-of-the-art rerankers in accuracy while being orders of magnitude faster.