Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero-shot generalization to novel objects, spatial layouts, and environments. We further enable reward-free online adaptation through hindsight goal relabeling with LoRA-based finetuning, allowing rapid autonomous improvement without external supervision. Real-robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating that goal-conditioned world models with multi-scale temporal control provide structured guidance necessary for robust long-horizon manipulation. Project page: https://act2goal.github.io/
翻译:以既具表达力又精确的方式指定机器人操作任务仍然是一个核心挑战。虽然视觉目标提供了紧凑且明确的任务规范,但现有的目标条件策略由于依赖单步动作预测而未对任务进展进行显式建模,因此在长时程操作中常常表现不佳。我们提出了Act2Goal,一种通用的目标条件操作策略,它将目标条件的视觉世界模型与多尺度时序控制相结合。给定当前观测和目标视觉目标,世界模型生成一个捕捉长时程结构的、合理的中间视觉状态序列。为了将这一视觉规划转化为鲁棒的执行,我们引入了多尺度时序哈希(MSTH),它将想象出的轨迹分解为密集的近端帧(用于细粒度的闭环控制)和稀疏的远端帧(用于锚定全局任务一致性)。该策略通过端到端的交叉注意力将这些表征与运动控制耦合,从而在保持对局部干扰反应能力的同时,实现连贯的长时程行为。Act2Goal在未见过的物体、空间布局和环境上实现了强大的零样本泛化能力。我们进一步通过基于LoRA微调的后视目标重标记,实现了无需奖励的在线自适应,允许在没有外部监督的情况下进行快速的自主改进。真实机器人实验表明,在几分钟的自主交互内,Act2Goal能将具有挑战性的分布外任务的成功率从30%提升至90%,这验证了具备多尺度时序控制的目标条件世界模型为鲁棒的长时程操作提供了必要的结构化指导。项目页面:https://act2goal.github.io/