Task-oriented dexterous grasping holds broad application prospects in robotic manipulation and human-object interaction. However, most existing methods still struggle to generalize across diverse objects and task instructions, as they heavily rely on costly labeled data to ensure task-specific semantic alignment. In this study, we propose \textbf{ZeroDexGrasp}, a zero-shot task-oriented dexterous grasp synthesis framework integrating Multimodal Large Language Models with grasp refinement to generate human-like grasp poses that are well aligned with specific task objectives and object affordances. Specifically, ZeroDexGrasp employs prompt-based multi-stage semantic reasoning to infer initial grasp configurations and object contact information from task and object semantics, then exploits contact-guided grasp optimization to refine these poses for physical feasibility and task alignment. Experimental results demonstrate that ZeroDexGrasp enables high-quality zero-shot dexterous grasping on diverse unseen object categories and complex task requirements, advancing toward more generalizable and intelligent robotic grasping.
翻译:任务导向的灵巧抓取在机器人操作与人-物交互中具有广阔的应用前景。然而,现有方法大多难以泛化至多样化的物体与任务指令,因其高度依赖昂贵的标注数据以确保任务特定的语义对齐。本研究提出\\textbf{ZeroDexGrasp},一种零样本任务导向灵巧抓取合成框架,通过融合多模态大语言模型与抓取姿态优化,生成符合特定任务目标与物体可供性的人类化抓取姿态。具体而言,ZeroDexGrasp采用基于提示的多阶段语义推理,从任务与物体语义中推断初始抓取配置及物体接触信息,进而利用接触引导的抓取优化对这些姿态进行物理可行性与任务对齐的精细化调整。实验结果表明,ZeroDexGrasp能够在多种未见物体类别与复杂任务要求下实现高质量的零样本灵巧抓取,推动机器人抓取向更具泛化性与智能化的方向发展。