AI & ML Practical Magic

A new scheduling framework allows a single consumer GPU to process billion-token sequences that would normally require a supercomputer.

April 23, 2026

Original Paper

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Yiming Bian, Joshua M. Akey

arXiv · 2604.20819

The Takeaway

Stream-CQSA removes the memory wall that has limited the context window of AI models. By decomposing the attention operation into independent tasks, this system avoids the out-of-memory errors that plague long-document processing. It makes infinite context theoretically possible on standard hardware without any loss in accuracy. Practitioners no longer need to buy massive server clusters to analyze entire libraries of data at once. This democratization of long-context AI could change everything from legal discovery to genomic sequencing.

From the abstract

The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes atten