AI & ML Efficiency Breakthrough

Introduces on-the-fly quantization that calibrates to individual prompts during inference, solving the 'domain shift' problem where standard quantization fails on unseen data.

March 25, 2026

Original Paper

TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

Toshiaki Koike-Akino, Jing Liu, Ye Wang

arXiv · 2603.19296

The Takeaway

Standard LLM quantization (like GPTQ or AWQ) relies on static calibration datasets; if test-time data differs significantly, performance collapses. TTQ enables activation-aware compression at runtime, ensuring optimal bit-precision efficiency for any prompt without needing a pre-defined calibration set.

From the abstract

To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prom

Read the original paper →

← Back to today's papers