Nature Is Weird / AI

A tiny window of just 100 training steps determines whether a Transformer learns to reason or simply memorizes its homework.

The Takeaway

Neural network training is not a slow, steady climb toward intelligence but a series of sharp, critical tipping points. Models decide whether to develop a general reasoning rule or resort to brute-force memorization within a very specific timeframe. Missing this window by a small margin can result in a model that functions like a lookup table rather than a logical engine. This research proves that the timing of complexity control is just as important as the data or the architecture itself. Engineers might be wasting massive compute resources by failing to nudge models toward reasoning during these fleeting moments.

By SeriesFusion Editorial Board · May 8, 2026

Original Paper

Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

Sarwan Ali

arXiv · 2605.04396

From the abstract

Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-complexity memorization. Existing analyses, however, treat complexity control as a single static hyperparameter choice, leaving open \emph{when} during training this control is actually decisive. We show that the memorization-versus-reasoning fate of a Transformer is

Read the original paper →

← Back to today's papers