ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang
Abstract
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research.
ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.
- Large-scale multi-platform dataset spanning 6 operating systems and 3 task domains
- Dual-loop data pipeline uniting automated agents with human experts for annotation
- Flexible inference paradigms for scalable agent framework integration
- Sets new SoTA on MMBench-GUI, OSWorld, and WebArena benchmarks
- Vision-language models
- Closed-loop data pipeline
- Multi-platform training
- Computer use agents
- WebArena-Lite-v2
- ScreenSpot-Pro
- MMBench-GUI L1-Hard
- OSWorld-G
Automatic data collection with iterative refinement in self-improving loop insufficiently explored
from the paperAdvanced agentic techniques like reflection and RL not employed but likely to improve performance
from the paperFlat history design cannot fully capture long-term dependencies
from the paper
Authors did not state explicit future directions.
Author keywords
- GUI Agent
- GUI Data Pipeline
- Computer Use
- Open Source
Related orals
Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching
MASK aligns semantic knowledge between images and text using word embeddings as bridges to match out-of-distribution words in unpaired matching.
VibeVoice: Expressive Podcast Generation with Next-Token Diffusion
Presents VibeVoice for zero-shot expressive long-form multi-speaker podcast generation using next-token diffusion.
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
UALM unified audio language model handles understanding, text-to-audio generation, and multimodal reasoning in single model with UALM-Reason for cross-modal generative reasoning.
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
MetaEmbed uses learnable meta tokens with matryoshka training to enable test-time scaling for multimodal retrieval balancing quality and efficiency.
BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
BioX-Bridge enables parameter-efficient cross-modal knowledge transfer across biosignals using lightweight prototype-based bridge networks between foundation models.