Introduces Helium, a serving framework that treats agentic workflows as data query plans to optimize redundant LLM calls and KV caches.
March 18, 2026
Original Paper
Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
arXiv · 2603.16104
The Takeaway
Current systems optimize single calls; Helium optimizes across the entire agentic loop, achieving 1.56x speedups by proactively caching and scheduling interdependent tasks common in multi-agent systems.
From the abstract
Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent