A new scheduling framework allows a single consumer GPU to process billion-token sequences that would normally require a supercomputer.
April 23, 2026
Original Paper
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
arXiv · 2604.20819
The Takeaway
Stream-CQSA removes the memory wall that has limited the context window of AI models. By decomposing the attention operation into independent tasks, this system avoids the out-of-memory errors that plague long-document processing. It makes infinite context theoretically possible on standard hardware without any loss in accuracy. Practitioners no longer need to buy massive server clusters to analyze entire libraries of data at once. This democratization of long-context AI could change everything from legal discovery to genomic sequencing.
From the abstract
The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes atten