AI & ML Efficiency Breakthrough

Introduces Helium, a serving framework that treats agentic workflows as data query plans to optimize redundant LLM calls and KV caches.

March 18, 2026

Original Paper

Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective

Noppanat Wadlom, Junyi Shen, Yao Lu

arXiv · 2603.16104

The Takeaway

Current systems optimize single calls; Helium optimizes across the entire agentic loop, achieving 1.56x speedups by proactively caching and scheduling interdependent tasks common in multi-agent systems.

From the abstract

Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent

Read the original paper →

← Back to today's papers