Research shows that giving large language models access to external tools like code interpreters can actually hurt their reasoning ability, even when it improves accuracy. A study accepted to ACL 2026 found that tool-augmented LLMs boosted final-answer accuracy by up to 19.3 points, but non-tool models win up to 41.5% more often in pairwise comparisons of reasoning processes. The issue is that models start leaning on tool outputs as shortcuts instead of working through problems step by step, producing answers that look correct but lack solid justificatio