Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on three datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running up to 2.7x faster. Moreover, our student surpasses all other state-of-the-art unsupervised video segmentation models.
翻译:无监督视频分割是一项具有挑战性的计算机视觉任务,尤其因缺乏监督信号及视觉场景的复杂性而困难重重。为应对这一挑战,基于槽注意力(slot attention)的先进模型通常不得不依赖庞大且计算成本高昂的神经网络架构。为此,我们提出了一种简单的知识蒸馏框架,能够有效地将对象中心表示迁移至轻量级学生模型。该框架名为SlotMatch,通过余弦相似度对齐教师模型与学生模型的对应槽位,无需额外的蒸馏目标或辅助监督。SlotMatch的简洁性得到了理论与实证证据的双重验证,均表明引入额外损失函数是冗余的。我们在三个数据集上进行了实验,将先进教师模型SlotContrast与我们蒸馏得到的学生模型进行比较。结果表明,基于SlotMatch的学生模型不仅匹配甚至超越了其教师模型的性能,同时参数量减少了3.6倍,运行速度提升了最高2.7倍。此外,我们的学生模型在所有其他先进的无监督视频分割模型中均表现出更优性能。