The recent SAM 3 and SAM 3D have introduced significant advancements over the predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3D's depth reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while the zero-shot evaluations of SAM 3D on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
翻译:近期提出的SAM 3和SAM 3D相较于前代SAM 2实现了显著进步,特别是在集成基于语言的分割与增强的三维感知能力方面。SAM 3支持基于点、边界框及语言提示的零样本分割,实现了更灵活直观的模型交互。本文通过实证评估,在机器人辅助手术场景中测试SAM 3的性能:对其基于点与边界框提示的零样本分割进行基准测试,探究其在动态视频追踪中的效果,并评估其新增的语言提示分割功能。语言提示虽具潜力,但在手术领域当前表现欠佳,凸显了领域特异性训练的必要性。同时,我们研究了SAM 3D的深度重建能力,证明其能处理手术场景数据并从二维图像重建三维解剖结构。通过在MICCAI EndoVis 2017和EndoVis 2018基准上的全面测试,SAM 3在空间提示下的图像与视频分割任务中均较SAM和SAM 2有明显提升;而SAM 3D在SCARED、StereoMIS和EndoNeRF数据集上的零样本评估表明,其在单目深度估计与三维器械重建方面表现优异,但也揭示了在复杂高动态手术场景中仍存在局限性。