Manipulation in cluttered environments is challenging due to spatial dependencies among objects, where an improper manipulation order can cause collisions or blocked access. Existing approaches often overlook these spatial relationships, limiting their flexibility and scalability. To address these limitations, we propose OrderMind, a unified spatial-aware manipulation ordering framework that directly learns object manipulation priorities based on spatial context. Our architecture integrates a spatial context encoder with a temporal priority structuring module. We construct a spatial graph using k-Nearest Neighbors to aggregate geometric information from the local layout and encode both object-object and object-manipulator interactions to support accurate manipulation ordering in real-time. To generate physically and semantically plausible supervision signals, we introduce a spatial prior labeling method that guides a vision-language model to produce reasonable manipulation orders for distillation. We evaluate OrderMind on our Manipulation Ordering Benchmark, comprising 163,222 samples of varying difficulty. Extensive experiments in both simulation and real-world environments demonstrate that our method significantly outperforms prior approaches in effectiveness and efficiency, enabling robust manipulation in cluttered scenes.
翻译:在杂乱环境中进行操作因物体间的空间依赖关系而具有挑战性,不当的操作顺序可能导致碰撞或访问受阻。现有方法常忽视这些空间关系,限制了其灵活性和可扩展性。为克服这些局限,我们提出了OrderMind,一个统一的空间感知操作排序框架,能够基于空间上下文直接学习物体的操作优先级。该架构将空间上下文编码器与时间优先级结构模块相结合。我们使用k近邻算法构建空间图,以聚合局部布局的几何信息,并编码物体-物体及物体-操作器之间的交互,从而支持实时准确的操作排序。为生成物理和语义上合理的监督信号,我们引入了一种空间先验标注方法,引导视觉-语言模型产生合理的操作顺序用于蒸馏训练。我们在包含163,222个不同难度样本的操作排序基准测试中评估了OrderMind。在仿真和真实环境中的大量实验表明,该方法在有效性和效率上显著优于先前方法,实现了在杂乱场景中的鲁棒操作。