AI & ML Nature Is Weird

Shrinking your LLM to make it faster can actually make it slower if the new 'shape' of the math upsets your GPU.

April 15, 2026

Original Paper

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

arXiv · 2604.09595

The Takeaway

The paper identifies 'dimensional misalignment,' a counterintuitive phenomenon where smaller, compressed models underperform larger ones on hardware. While we usually assume fewer parameters equals faster execution, this shows that tensor shapes incompatible with GPU execution patterns create massive overhead. It means that model compression isn't just about reducing weights, but about hardware-aware geometry. This forces a change in how we design 'mobile' or 'edge' models. You can no longer just prune a model and expect it to run faster without checking the hardware's math preferences.

From the abstract

Post-training compression reduces LLM parameter counts but often produces irregular tensor dimensions that degrade GPU performance -- a phenomenon we call \emph{dimensional misalignment}. We present a full-stack analysis tracing root causes at three levels: framework, library, and hardware. The key insight is that model inference becomes slower because the resulting dimensions are unfriendly with the GPU execution stack. For example, compressing Llama-3-8B with activation-aware singular value de

Read the original paper →

← Back to today's papers