ICLR 2026 Orals

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang

Multimodal & Speech Fri, Apr 24 · 10:30 AM–10:40 AM · Amphitheater Avg rating: 6.80 (6–10)

Abstract

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Large-scale multi-platform dataset spanning 6 operating systems and 3 task domains
  • Dual-loop data pipeline uniting automated agents with human experts for annotation
  • Flexible inference paradigms for scalable agent framework integration
  • Sets new SoTA on MMBench-GUI, OSWorld, and WebArena benchmarks
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Vision-language models
  • Closed-loop data pipeline
  • Multi-platform training
  • Computer use agents
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • WebArena-Lite-v2
  • ScreenSpot-Pro
  • MMBench-GUI L1-Hard
  • OSWorld-G
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Automatic data collection with iterative refinement in self-improving loop insufficiently explored
    from the paper
  • Advanced agentic techniques like reflection and RL not employed but likely to improve performance
    from the paper
  • Flat history design cannot fully capture long-term dependencies
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • GUI Agent
  • GUI Data Pipeline
  • Computer Use
  • Open Source

Related orals

Something off? Let us know →