CVT-SLR: 对比视觉-文本变换与变分对齐在手语识别中的应用 (CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment)

Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign language datasets becomes the main bottleneck for SLR. The majority of SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is proposed to further enhance the explicit consistency constraints. Extensive experiments conducted on the two most popular public datasets, PHOENIX-2014 and PHOENIX-2014T, demonstrate that our proposed SLR framework not only consistently outperforms existing single-cue methods but even outperforms SOTA multi-cue methods.

翻译：手语识别 (SLR) 是一个弱监督任务，将手语视频标注为文本词汇。最近的研究表明，由于缺乏大规模的手语数据集，训练不足已成为 SLR 的主要瓶颈。大多数 SLR 工作因此采用预训练的视觉模块，并开发了两种主流解决方案。多流架构扩展多线索视觉特征，产生了当前 SOTA 的性能，但需要复杂的设计并可能引入潜在噪声。作为替代方案，使用显式的视觉和文本模态之间的跨模态对齐的高级单线索 SLR 框架是简单而有效的，可能与多线索框架有竞争力。在这项工作中，我们提出了一种新的对比视觉-文本转换进行 SLR，即 CVT-SLR，以完全探索视觉和语言模态的预训练知识。基于单线索的跨模态对齐框架，我们提出了隐式对齐的变分自编码器（VAE）用于预训练上下文知识，同时引入完整的预训练语言模块。VAE 隐含地对齐视觉和文本模态，并受益于预训练上下文模块作为传统的上下文模块。同时，我们提出了一种对比跨模态对齐算法，以进一步增强显式一致性约束。对 PHOENIX-2014 和 PHOENIX-2014T 两个最受欢迎的公共数据集进行的广泛实验证明，我们提出的 SLR 框架不仅在一致的现有单线索方法方面表现出色，甚至超过了 SOTA 的多线索方法。