AI & ML Breaks Assumption

Proves that standard 'wisdom' like Chain-of-Thought and Few-Shot prompting actually degrades performance in specialized medical LLMs.

March 30, 2026

Original Paper

When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

Binesh Sadanandan, Vahid Behzadan

arXiv · 2603.25960

The Takeaway

The study reveals a sharp disconnect between general-purpose and domain-specific LLM behavior, showing accuracy drops of up to 11.9% when using common prompting tricks. Practitioners in specialized fields should pivot toward log-probability 'cloze' scoring, which significantly outperformed all generative prompting strategies.

From the abstract

Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasin