Speculative Actions: A Lossless Framework for Faster AI Agents
Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, Tianyi Peng
We introduce speculative actions—a lossless framework that predicts likely actions using faster models, enabling multiple API calls to be executed in parallel and thus yields substantial acceleration.
Abstract
AI agents are increasingly deployed in complex, interactive environments, yet their runtime remains a major bottleneck for training, evaluation, and real-world use. Typical agent behavior unfolds sequentially, where each action requires an API call that can incur substantial latency. For example, a game of chess between two state-of-the-art agents can take hours. We introduce speculative actions, a lossless acceleration framework for general agentic systems. Inspired by speculative execution in microprocessors and speculative decoding in LLM inference, our method uses faster models to predict likely future actions and executes them in parallel, committing only when predictions match. We evaluate speculative actions across gaming, e-commerce, and web search environments, and additionally study a lossy extension in an operating systems setting. Across domains, we achieve up to 55% next-action prediction accuracy, translating into substantial latency reductions. Finally, we present a cost–latency analysis that formalizes the tradeoff between speculative breadth and time savings. This analysis enables principled tuning and selective branch launching, to ensure multi-branch speculation delivers practical speedups without prohibitive cost growth.
Speculative Actions accelerates agent systems by predicting and executing likely future actions in parallel.
- Lossless framework breaking sequential interaction loops through prediction and parallelization
- Treats every step (LLM call, tool, MCP request) as API subject to prediction and parallelization
- Up to 55% next-action prediction accuracy translating to substantial latency reductions
- Cost-latency analysis formalizing tradeoffs between speculative breadth and time savings
- Speculative execution
- Action prediction
- Parallelization
- Multi-branch speculation
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- AI Agents
- Speculative Decoding
- Parallel Execution
- Agentic Serving
- Agentic Simulation
Related orals
TileLang: Bridge Programmability and Performance in Modern Neural Kernels
TileLang enables hardware-aware fused kernel programming with tile inference and recommendation achieving 5-6x speedup.
Probabilistic Kernel Function for Fast Angle Testing
Proposes probabilistic kernel functions for angle testing enabling efficient approximate nearest neighbor search.
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
Generates minute-long high-resolution videos efficiently with linear attention and constant-memory KV cache for block autoregression.
Efficient Resource-Constrained Training of Transformers via Subspace Optimization
WASI applies subspace-based training to transformer models reducing memory by 62x and FLOPs by 2x while maintaining accuracy on edge devices.
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Analyzes low-precision flash attention training failure caused by low-rank representations and biased BF16 rounding errors.