AI & ML Paradigm Challenge

A standard industry trick used to make medical AI more accurate is actually making it up to 30% worse.

April 14, 2026

Original Paper

I Can't Believe TTA Is Not Better: When Test-Time Augmentation Hurts Medical Image Classification

Daniel Nobrega Medeiros

arXiv · 2604.09697

The Takeaway

Test-Time Augmentation is a 'best practice' routinely deployed in production systems, but this research proves it often backfires in medical contexts. This means many of the AI models currently used by doctors might be significantly less reliable than previously thought.

From the abstract

Test-time augmentation (TTA)--aggregating predictions over multiple augmented copies of a test input--is widely assumed to improve classification accuracy, particularly in medical imaging where it is routinely deployed in production systems and competition solutions. We present a systematic empirical study challenging this assumption across three MedMNIST v2 benchmarks and four architectures spanning three orders of magnitude in parameter count (21K to 11M). Our principal finding is that TTA wit