A new inference paradigm cuts AI reasoning costs by 400% while making the models significantly faster at solving complex problems.
April 25, 2026
Original Paper
HARLM: Hierarchical Adaptive Recursive Language Models: Achieving Order-of-Magnitude Cost Reduction and Performance Gains Through Parallel Speculative Execution and Learned Context Routing
SSRN · 6397658
The Takeaway
Hierarchical Adaptive Recursive Language Models use parallel execution to destroy the latency bottleneck of modern AI. The system routes context dynamically and uses speculative execution to find answers more efficiently. This results in an order-of-magnitude shift in how much it costs to run a high-reasoning model. Previous architectures were too expensive for wide-scale deployment in complex tasks. This breakthrough makes advanced AI reasoning cheap enough for everyday business applications.
From the abstract
I introduce Hierarchical Adaptive Recursive Language Models (HARLM), a fundamentally reimagined inference paradigm that extends and dramatically improves upon Recursive Language Models (RLMs). While RLMs demonstrated that treating prompts as external environment variables enables processing of arbitrarily long contexts, they suffer from three critical limitations: (1) synchronous sequential execution creating latency bottlenecks, (2) fixed shallow recursion depth limiting expressiveness, and (3)