Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and strategic tool calls. Fourth, a two-stage training methodology combines supervised fine-tuning with online reinforcement learning, enabling intelligent action selection between GUI and API. Evaluation with our 7B and 32B UltraCUA models reveals transformative performance gains. On OSWorld, UltraCUA achieves 22% relative improvement while executing 11% faster than existing approaches, averagely. Cross-domain validation on WindowsAgentArena demonstrates robust generalization with 21.7% success rate, surpassing Windows-trained baselines. The hybrid action paradigm proves essential, reducing error propagation while improving execution efficiency. This work establishes a scalable paradigm bridging primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer use agents for diverse environments and complex real-world tasks.
翻译:计算机使用智能体面临一个根本性限制:它们完全依赖于原始图形用户界面(GUI)动作(点击、输入、滚动),形成了脆弱的执行链,容易导致级联故障。尽管API驱动的智能体通过结构化接口和工具利用丰富的功能,但计算机使用智能体仍局限于低层次的视觉交互。我们提出了UltraCUA,一种通过混合动作超越这一限制的基础模型——无缝统一原始GUI操作与高层次工具执行。我们的创新基于四项关键进展:首先,自动化流水线从软件文档和代码仓库中提取并扩展工具能力;其次,合成数据引擎生成超过17,000个可验证任务,捕捉真实世界计算机使用的复杂性;第三,全面的混合动作轨迹收集整合了GUI原始操作与策略性工具调用;第四,两阶段训练方法结合了监督微调与在线强化学习,实现了GUI与API之间的智能动作选择。通过我们的7B和32B UltraCUA模型评估显示出变革性的性能提升:在OSWorld基准上,UltraCUA平均实现了22%的相对改进,同时执行速度比现有方法快11%;在WindowsAgentArena上的跨领域验证展示了稳健的泛化能力,成功率达到21.7%,超越了基于Windows训练的基线。混合动作范式被证明至关重要,它减少了错误传播并提高了执行效率。这项工作建立了一个可扩展的范式,桥接了原始GUI交互与高层次工具智能,为多样环境和复杂现实任务实现了更具韧性和适应性的计算机使用智能体。