Reveals that linearized attention never converges to the NTK limit in practice, explaining its unique 'influence malleability' compared to standard networks.
March 16, 2026
Original Paper
Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics
arXiv · 2603.13085
The Takeaway
It challenges the conventional use of kernel frameworks to explain attention, showing that its non-convergence is actually the source of its power and its specific vulnerability to training-time adversarial attacks.
From the abstract
Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit,