GlowQ introduces group-shared low-rank approximations to speed up quantized LLM inference by up to 37%.
March 27, 2026
Original Paper
GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs
arXiv · 2603.25385
The Takeaway
Unlike previous low-rank correction methods that add overhead to every layer, GlowQ reuses correction modules across input-sharing groups. This provides a significant boost to throughput and 'Time To First Token' (TTFB) while actually improving accuracy over standard 4-bit quantization.
From the abstract
Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-s