AI & ML Nature Is Weird

Small medical AI models will give you a different answer to the same question 97% of the time, revealing a massive 'safety gap.'

April 15, 2026

Original Paper

Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

arXiv · 2604.10535

The Takeaway

This evaluation of small open LLMs reveals extreme inconsistency in medical QA, with self-agreement reaching as low as 0.20. Even at low temperatures, nearly every output was unique. This is a terrifying prospect for medical use: an AI that gives a different diagnosis every time you ask. It shows that while small models are getting 'smarter,' they aren't getting more 'stable.' For developers, this means that small models are not yet ready for high-stakes healthcare without heavy ensemble methods or massive stability improvements. Reliability is the new bottleneck, not knowledge.

From the abstract

Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctn