ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang

Multimodal & Speech Fri, Apr 24 · 10:30 AM–10:40 AM · Amphitheater Avg rating: 6.80 (6–10)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Abstract

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Large-scale multi-platform dataset spanning 6 operating systems and 3 task domains
Dual-loop data pipeline uniting automated agents with human experts for annotation
Flexible inference paradigms for scalable agent framework integration
Sets new SoTA on MMBench-GUI, OSWorld, and WebArena benchmarks

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Vision-language models
Closed-loop data pipeline
Multi-platform training
Computer use agents

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

WebArena-Lite-v2
ScreenSpot-Pro
MMBench-GUI L1-Hard
OSWorld-G

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Automatic data collection with iterative refinement in self-improving loop insufficiently explored
from the paper
Advanced agentic techniques like reflection and RL not employed but likely to improve performance
from the paper
Flat history design cannot fully capture long-term dependencies
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

GUI Agent
GUI Data Pipeline
Computer Use
Open Source

Something off? Let us know →

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Abstract

Author keywords

Related orals

Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals