AI & ML Paradigm Challenge

Common speed hacks for AI cause the models to give completely different answers than the slow versions.

April 20, 2026

Original Paper

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

Ranjith Chodavarapu, Lei Xu

arXiv · 2604.15409

The Takeaway

KV caching in FP16 precision creates a systematic numerical divergence that causes a 100 percent token mismatch in tested models. Industry experts have long treated caching as a free lunch that only affects speed, not the actual output. This research proves that the act of storing and reusing previous computations fundamentally changes the final text. Small rounding errors in the cache accumulate until the model takes a completely different path than it would have without the optimization. Practitioners must now choose between maximum inference speed and the literal accuracy of their model original weights.

From the abstract

KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we obse

Read the original paper →

← Back to today's papers