OpenApps: Simulating Environment Variations to Measure UI Agent Reliability
Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim
We introduce a new environment, OpenApps, for generating thousands of versions of apps to test UI agent reliability.
Abstract
Reliability is key to realizing the promise of autonomous UI-agents, multimodal agents that directly interact with the apps humans use, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments---often clones of existing apps--- which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent’s ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than 50\% across app variations. For example, Kimi-VL-3B's average success across all tasks fluctuates from 63\% to just 4\% across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations.
OpenApps testbed reveals UI agent reliability varies drastically across app variations despite stable within-environment performance.
- Flexible simulator for generating thousands of app versions with configurable appearance and content
- Over 10,000 independent evaluations studying reliability across 7 leading multimodal agents
- Demonstration that task success rates fluctuate >50% across app variations
- First systematic evaluation of agent reliability under varying environment configurations
- UI agent evaluation
- Environment simulation
- Reliability measurement
- Multimodal agent benchmarking
Focus on simple tasks requiring few steps that do not represent full real-world complexity
from the paperEach app variation factor studied independently without considering interactions
from the paperFocus on autonomous agents without human validation or interaction
from the paper
Extend task set to include more complex and longer-horizon tasks
from the paperStudy interactions between multiple app variation factors
from the paperUse OpenApps for scaling agent training pipelines via RL fine-tuning
from the paperExplore sim-to-real transfer and generalization across app variations
from the paper
Author keywords
- reinforcement learning
- agents
- envrionment
- reliability
Related orals
Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning
SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT enables efficient online reinforcement learning for diffusion models via forward process optimization with up to 25x efficiency gains.
Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
Hyperparameter Trajectory Inference uses conditional Lagrangian optimal transport to reconstruct neural network outputs across hyperparameter spectra without expensive retraining.