Tool-augmented Language Models (TaLMs)
can invoke external tools to solve problems
beyond their parametric capacity. However,
it remains unclear whether these tool-enabled
gains reflect trustworthy reasoning. Focusing
on the Code Interpreter tool, we show that even
when tools are selected and executed correctly,
TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct
but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and
study it using PYMATH, a benchmark of 1,679
competition-level mathematical problems for
which Python code is helpful but not sufficient.
We further develop a multi-dimensional evaluation suite to quantify reasoning degradation
in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs
achieve up to a 19.3 percentage point gain in
final-answer accuracy, their reasoning behavior
consistently deteriorates (e.g., non-tool LLMs
win up to 41.5% more often in pairwise comparisons of reasoning process). This degradation
intensifies with tool use; the more frequently a
model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity);
with TIM present in ~55% of high-risk cases.
Finally, we propose a preference-optimizationbased framework that realigns TaLMs to use
tools as assistive evidence, improving both
final-answer accuracy and reasoning depth under tool use. Codes and data are available at:
https://github.com/megagonlabs/TIM.