Common speed hacks for AI cause the models to give completely different answers than the slow versions.
April 20, 2026
Original Paper
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
arXiv · 2604.15409
The Takeaway
KV caching in FP16 precision creates a systematic numerical divergence that causes a 100 percent token mismatch in tested models. Industry experts have long treated caching as a free lunch that only affects speed, not the actual output. This research proves that the act of storing and reusing previous computations fundamentally changes the final text. Small rounding errors in the cache accumulate until the model takes a completely different path than it would have without the optimization. Practitioners must now choose between maximum inference speed and the literal accuracy of their model original weights.
From the abstract
KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we obse