Action chunking is a widely adopted approach in Learning from Demonstration (LfD). By modeling multi-step action chunks rather than single-step actions, action chunking significantly enhances modeling capabilities for human expert policies. However, the reduced decision frequency restricts the utilization of recent observations, degrading reactivity - particularly evident in the inadequate adaptation to sensor noise and dynamic environmental changes. Existing efforts to address this issue have primarily resorted to trading off reactivity against decision consistency, without achieving both. To address this limitation, we propose a novel algorithm, Temporal Action Selector (TAS), which caches predicted action chunks from multiple timesteps and dynamically selects the optimal action through a lightweight selector network. TAS achieves balanced optimization across three critical dimensions: reactivity, decision consistency, and motion coherence. Experiments across multiple tasks with diverse base policies show that TAS significantly improves success rates - yielding an absolute gain of up to 73.3%. Furthermore, integrating TAS as a base policy with residual reinforcement learning (RL) substantially enhances training efficiency and elevates the performance plateau. Experiments in both simulation and physical robots confirm the method's efficacy.
翻译:动作分块是从演示学习(LfD)中广泛采用的一种方法。通过建模多步动作块而非单步动作,动作分块显著增强了对人类专家策略的建模能力。然而,降低的决策频率限制了近期观测的利用,从而削弱了反应性——这在适应传感器噪声和动态环境变化方面的不足尤为明显。现有解决该问题的努力主要依赖于在反应性与决策一致性之间进行权衡,未能同时实现两者。为克服这一局限,我们提出了一种新颖算法——时序动作选择器(TAS),该算法缓存来自多个时间步的预测动作块,并通过轻量级选择器网络动态选择最优动作。TAS在三个关键维度上实现了平衡优化:反应性、决策一致性与运动连贯性。在多种任务及不同基础策略上的实验表明,TAS显著提升了成功率——绝对增益最高达73.3%。此外,将TAS作为基础策略与残差强化学习(RL)结合,大幅提高了训练效率并提升了性能上限。仿真与实体机器人实验均验证了该方法的有效性。