Provides the first controlled study of Silent Data Corruption (SDC) in GPUs and its catastrophic impact on LLM pretraining stability.
April 2, 2026
Original Paper
Exploring Silent Data Corruption as a Reliability Challenge in LLM Training
arXiv · 2604.00726
The Takeaway
It demonstrates that hardware-induced faults can mimic benign noise while causing persistent parameter divergence and loss spikes. The proposed lightweight detection method allows infra teams to mitigate SDC-induced failures without the overhead of full hardware redundancy.
From the abstract
As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress.This work provides a controlled study of how intermittent SDC affects LLM pretrainin