Introduces on-the-fly quantization that calibrates to individual prompts during inference, solving the 'domain shift' problem where standard quantization fails on unseen data.
March 25, 2026
Original Paper
TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly
arXiv · 2603.19296
The Takeaway
Standard LLM quantization (like GPTQ or AWQ) relies on static calibration datasets; if test-time data differs significantly, performance collapses. TTQ enables activation-aware compression at runtime, ensuring optimal bit-precision efficiency for any prompt without needing a pre-defined calibration set.
From the abstract
To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prom