We’re excited to share that our paper on Tool-Induced Myopia, also known as TIM, has been accepted to the ACL 2026 main conference.
Tool Use Can Improve Accuracy Without Improving AI Reasoning
Tool-augmented LLMs can boost accuracy by calling external tools, such as code interpreters. But these gains don’t usually reflect improved reasoning.
Our study shows a surprising trade-off: while tool use improves final-answer accuracy by up to 19.3 points, it comes at the cost of degraded reasoning quality. Models often rely on tool outputs as substitutes for reasoning, producing responses that appear correct but lack coherent justification. In some cases, non-tool LLMs outperform tool-augmented models by up to 41.5% in reasoning comparison.
Introducing PyMath
We introduce PyMath, a benchmark of 1,679 competition-level math problems, and a multi-dimensional evaluation suite to quantify the TIM effect. We also propose a framework to realign tool use, so models treat tools as evidence rather than substitutes, which improves both model accuracy and reasoning depth.
As LLM systems become increasingly agentic and tool-driven, understanding how tools affect reasoning is critical.
Read the Paper
To dive deeper into the research explore our github repo and the paper.
Research Authors: Farima Fatahi Bayat, Pouya Pezeshkpour, and Estevam Hruschka
Frequently Asked Questions
Does using tools make AI less reliable at reasoning?
Research shows that giving large language models access to external tools like code interpreters can actually hurt their reasoning ability, even when it improves accuracy. A study accepted to ACL 2026 found that tool-augmented LLMs boosted final-answer accuracy by up to 19.3 points, but non-tool models win up to 41.5% more often in pairwise comparisons of reasoning processes. The issue is that models start leaning on tool outputs as shortcuts instead of working through problems step by step, producing answers that look correct but lack solid justificatio
Why do LLMs get the right answer but explain it poorly?
When large language models have access to tools, calculators, code execution, search, they tend to offload the hard thinking to those tools. The result is a correct answer with weak or incoherent reasoning behind it, a pattern researchers call Tool-Induced Myopia. It happens because models learn to treat tool outputs as the answer itself rather than as assisting evidence to reason from. This is especially important as AI systems become more agentic, since a system that can't explain its decisions is harder to trust and debug.
Can AI tool use be improved without sacrificing reasoning quality?
Yes. The key is designing systems where tools support a model's reasoning rather than replace it. A realignment framework proposed in a recent ACL 2026 paper trains models to treat tool results as evidence within a broader chain of thought, not as a substitute for thinking. When tool use is structured this way, both accuracy and reasoning depth improve. Developers building tool-augmented AI should evaluate reasoning quality alongside accuracy, rather than merely measuring whether the final answer is correct.