Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model. One interesting finding was that the attention indirectly learns the association between the audio and the speaking face even though this correspondence is never explicitly provided at training time. In the present work we further investigate this connection and examine the interplay between the two problems. With experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task. Secondly, we show under closer scrutiny that an end-to-end model performs at least as well as a considerably larger two-step system that utilizes a hard decision boundary under various noise conditions and number of parallel face tracks.
翻译:视听自动语音识别是一个很有希望的方法,在吵闹的条件下对强烈的ASR进行动态自动语音识别。然而,直到最近,一直以来,一直以孤立的方式研究它,假设单张讲话脸的视频与音频相匹配,在多个人在屏幕上时作为单独的问题在推论时间选择活跃的演讲者,作为一个单独的问题被搁置一边。作为替代办法,最近的工作提议在关注机制下同时解决这两个问题,将演讲者选择问题直接转化为完全不同的模式。一个有趣的发现是,尽管在培训时从未明确提供这种信函,但人们间接地了解到音频面和声音面部之间的联系。在目前的工作中,我们进一步调查这一联系,并审查这两个问题之间的相互作用。在涉及5万多小时的公开YouTube视频作为培训数据的实验中,我们首先评估了积极演讲者选择任务的注意层的准确性。第二,我们仔细检查后显示,一个终端到终端模式至少表现了以及一个相当大得多的两步系统,在各种噪音条件下和平行的轨道上使用硬决定边界。