Forcing an elite AI to show its work actually makes its final answer worse.
High-performing models see their accuracy drop when they are forced to write out their reasoning in the conversation history. While Chain-of-Thought helps weaker models catch up, the most capable systems find their own articulated thoughts to be a distraction. The extra text acts as context noise that hinders the model's ability to reach the best conclusion. This discovery challenges the common belief that structured reasoning is always a net positive for AI. Developers building complex workflows should stop forcing top-tier models to think out loud if they want the best results.
Structured Reasoning in LLM Optimization Agents: Scaffolding, Not Regularization
SSRN · 6655539
LLM-based optimization agents increasingly produce structured reasoning artifacts-hypothesis summaries, causal models, prediction logs-that persist across iterations. The assumption is that forcing articulation regularizes reasoning, as the self-explanation effect suggests it does for human learners. We test this assumption using SynthOracle, a family of synthetic multi-objective optimization oracles with known causal structure that enables separate measurement of optimization quality and reason