This paper presents the Speech Technology Center (STC) speaker recognition (SR) systems submitted to the VOiCES From a Distance challenge 2019. The challenge's SR task is focused on the problem of speaker recognition in single channel distant/far-field audio under noisy conditions. In this work we investigate different deep neural networks architectures for speaker embedding extraction to solve the task. We show that deep networks with residual frame level connections outperform more shallow architectures. Simple energy based speech activity detector (SAD) and automatic speech recognition (ASR) based SAD are investigated in this work. We also address the problem of data preparation for robust embedding extractors training. The reverberation for the data augmentation was performed using automatic room impulse response generator. In our systems we used discriminatively trained cosine similarity metric learning model as embedding backend. Scores normalization procedure was applied for each individual subsystem we used. Our final submitted systems were based on the fusion of different subsystems. The results obtained on the VOiCES development and evaluation sets demonstrate effectiveness and robustness of the proposed systems when dealing with distant/far-field audio under noisy conditions.
翻译:本文介绍了提交给2019年远程挑战的语音技术中心(STC)语音识别系统。挑战的SR任务侧重于在噪音条件下在单频道远程/远场音频中以单频道远程/远场音频识别语音的问题。在这项工作中,我们调查了不同深神经网络结构,用于让发言者嵌入提取,以解决这个问题。我们显示,具有剩余框架连接的深网络比浅层结构更优于浅层结构。在这项工作中调查了基于简单能源的语音活动探测器(SAD)和基于SAD的自动语音识别(ASR)的简易语音识别系统。我们还解决了为强大的嵌入提取器培训而准备数据的问题。数据增强数据的重新校正是使用自动室脉冲响应生成器进行的。在我们的系统中,我们使用了经过有区别的训练的共性相似模型学习模型,作为嵌入后端。我们使用的每个单个子系统都采用了分数正常化程序。我们最后提交的系统是以不同子系统的聚合为基础。在VoICS开发和评价中取得的结果表明,拟议的系统在与远地/远方声音打交道时的有效性和稳健。