AI & ML Breaks Assumption

Re-evaluating high-profile medical AI safety claims reveals that reported triage failures were artifacts of the 'exam-style' evaluation format rather than model incapacity.

March 13, 2026

Original Paper

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

David Fraile Navarro, Farah Magrabi, Enrico Coiera

arXiv · 2603.11413

The Takeaway

This is a significant methodological correction to AI safety literature. It proves that forced-choice (A/B/C/D) evaluation scaffolds can hide correct model reasoning and generate misleading safety risks, necessitating a shift toward naturalistic testing in high-stakes domains.

From the abstract

Ramaswamy et al. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17