The Segmentation Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos, capable of storing object-aware memories and transferring them temporally through memory blocks. While SAM2 excels in video object segmentation by providing dense segmentation masks based on prompts, extending it to dense Video Semantic Segmentation (VSS) poses challenges due to the need for spatial accuracy, temporal consistency, and the ability to track multiple objects with complex boundaries and varying scales. This paper explores the extension of SAM2 for VSS, focusing on two primary approaches and highlighting firsthand observations and common challenges faced during this process. The first approach involves using SAM2 to extract unique objects as masks from a given image, with a segmentation network employed in parallel to generate and refine initial predictions. The second approach utilizes the predicted masks to extract unique feature vectors, which are then fed into a simple network for classification. The resulting classifications and masks are subsequently combined to produce the final segmentation. Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
翻译:Segment Anything Model 2 (SAM2) 已被证明是一种强大的基础模型,适用于图像和视频中基于提示的可提示视觉对象分割,能够存储对象感知记忆并通过记忆模块在时间维度上进行传递。尽管SAM2通过基于提示提供密集分割掩码在视频对象分割方面表现出色,但将其扩展至密集视频语义分割(VSS)仍面临挑战,这主要源于对空间精度、时间一致性以及跟踪具有复杂边界和不同尺度的多对象能力的需求。本文探讨了将SAM2扩展至VSS的方法,重点关注两种主要途径,并强调了在此过程中遇到的第一手观察结果和常见挑战。第一种方法涉及使用SAM2从给定图像中提取独特对象作为掩码,同时并行采用分割网络生成并优化初始预测。第二种方法利用预测的掩码提取独特的特征向量,随后将其输入一个简单的分类网络进行分类。最终,分类结果与掩码相结合以生成最终的分割结果。我们的实验表明,利用SAM2能够提升VSS的整体性能,这主要归功于其对对象边界的精确预测。