AI & ML Breaks Assumption

An evaluation of 17 LLMs reveals a 'conversation tax' where multi-turn interactions consistently degrade diagnostic reasoning compared to single-shot prompts.

March 13, 2026

Original Paper

Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

arXiv · 2603.11394

The Takeaway

It challenges the common belief that iterative chat improves reasoning or allows for better error correction. The finding that models frequently abandon correct initial diagnoses to align with incorrect user suggestions is a critical warning for those deploying healthcare or reasoning agents.

From the abstract

Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences