Few-shot semantic segmentation has attracted growing interest for its ability to generalize to novel object categories using only a few annotated samples. To address data scarcity, recent methods incorporate multiple foundation models to improve feature transferability and segmentation performance. However, they often rely on dual-branch architectures that combine pre-trained encoders to leverage complementary strengths, a design that limits flexibility and efficiency. This raises a fundamental question: can we build a unified model that integrates knowledge from different foundation architectures? Achieving this is, however, challenging due to the misalignment between class-agnostic segmentation capabilities and fine-grained discriminative representations. To this end, we present UINO-FSS, a novel framework built on the key observation that early-stage DINOv2 features exhibit distribution consistency with SAM's output embeddings. This consistency enables the integration of both models' knowledge into a single-encoder architecture via coarse-to-fine multimodal distillation. In particular, our segmenter consists of three core components: a bottleneck adapter for embedding alignment, a meta-visual prompt generator that leverages dense similarity volumes and semantic embeddings, and a mask decoder. Using hierarchical cross-model distillation, we effectively transfer SAM's knowledge into the segmenter, further enhanced by Mamba-based 4D correlation mining on support-query pairs. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ show that UINO-FSS achieves new state-of-the-art results under the 1-shot setting, with mIoU of 80.6 (+3.8%) on PASCAL-5$^i$ and 64.5 (+4.1%) on COCO-20$^i$, demonstrating the effectiveness of our unified approach.
翻译:小样本语义分割因其仅需少量标注样本即可泛化至新物体类别的能力而受到日益关注。为应对数据稀缺问题,近期方法引入多种基础模型以提升特征可迁移性与分割性能。然而,这些方法通常依赖双分支架构,通过组合预训练编码器来利用互补优势,这种设计限制了灵活性与效率。这引发了一个根本性问题:能否构建一个统一模型,整合来自不同基础架构的知识?然而,由于类无关分割能力与细粒度判别表征之间的不匹配,实现这一目标具有挑战性。为此,我们提出UINO-FSS,该新颖框架基于一个关键观察:早期阶段DINOv2特征与SAM的输出嵌入表现出分布一致性。这种一致性使得能够通过从粗到细的多模态蒸馏,将两种模型的知识整合到单编码器架构中。具体而言,我们的分割器包含三个核心组件:用于嵌入对齐的瓶颈适配器、利用密集相似性体积与语义嵌入的元视觉提示生成器,以及掩码解码器。通过层次化跨模型蒸馏,我们将SAM的知识有效迁移至分割器,并借助基于Mamba的支持-查询对四维相关性挖掘进一步增强性能。在PASCAL-5$^i$和COCO-20$^i$上的大量实验表明,UINO-FSS在1-shot设置下取得了新的最先进结果,在PASCAL-5$^i$上mIoU达到80.6(提升3.8%),在COCO-20$^i$上达到64.5(提升4.1%),验证了我们统一方法的有效性。