ShowUI-$π$：基于流的生成模型作为图形用户界面灵巧操作手 (ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands)

Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.

翻译：构建能够进行灵巧操作的智能体，对于在机器人学和数字环境中实现类人自动化至关重要。然而，现有的图形用户界面（GUI）智能体依赖于离散的点击预测（x, y坐标），这阻碍了需要连续、实时感知与调整的自由形式、闭环轨迹操作（例如拖动进度条）。在本工作中，我们开发了ShowUI-$π$，这是首个作为GUI灵巧操作手的基于流的生成模型，其设计特点如下：（i）统一离散-连续动作，将离散点击和连续拖拽集成在一个共享模型中，使其能够灵活适应不同的交互模式；（ii）用于拖拽建模的基于流的动作生成，通过一个轻量级动作专家，根据连续的视觉观察预测光标的增量调整，确保轨迹平滑稳定；（iii）拖拽训练数据与基准测试，我们手动收集并合成了涵盖五个领域（例如 PowerPoint、Adobe Premiere Pro）的2万条拖拽轨迹，并引入了ScreenDrag基准，该基准包含全面的在线和离线评估协议，用于评估GUI智能体的拖拽能力。我们的实验表明，专有的GUI智能体在ScreenDrag上仍然表现不佳（例如Operator得分为13.27，表现最佳的Gemini-2.5-CUA达到22.18）。相比之下，仅拥有4.5亿参数的ShowUI-$π$取得了26.98的得分，这既凸显了任务的难度，也证明了我们方法的有效性。我们希望这项工作能推动GUI智能体在数字世界中实现类人的灵巧控制。代码可在 https://github.com/showlab/showui-pi 获取。