Public sources like parliament meeting recordings and transcripts provide ever-growing material for the training and evaluation of automatic speech recognition (ASR) systems. In this paper, we publish and analyse the Finnish parliament ASR corpus, the largest publicly available collection of manually transcribed speech data for Finnish with over 3000 hours of speech and 449 speakers for which it provides rich demographic metadata. This corpus builds on earlier initial work, and as a result the corpus has a natural split into two training subsets from two periods of time. Similarly, there are two official, corrected test sets covering different times, setting an ASR task with longitudinal distribution-shift characteristics. An official development set is also provided. We develop a complete Kaldi-based data preparation pipeline, and hidden Markov model (HMM), hybrid deep neural network (HMM-DNN) and attention-based encoder-decoder (AED) ASR recipes. We set benchmarks on the official test sets, as well as multiple other recently used test sets. Both temporal corpus subsets are already large, and we observe that beyond their scale, ASR performance on the official test sets plateaus, whereas other domains benefit from added data. The HMM-DNN and AED approaches are compared in a carefully matched equal data setting, with the HMM-DNN system consistently performing better. Finally, the variation of the ASR accuracy is compared between the speaker categories available in the parliament metadata to detect potential biases based on factors such as gender, age, and education.
翻译:议会会议记录和记录誊本等公共来源为培训和评价自动语音识别系统提供了越来越多的材料。在本文件中,我们公布和分析芬兰议会ASR文集,这是芬兰人最公开的人工转录语音数据,其演讲时间超过30小时,有449个发言者,并提供丰富的人口元数据。该文集以较早的初始工作为基础,因此,该文集自然分为两个培训子集,分两个时期,同样,有两套涵盖不同时间的经过更正的正式测试套件,规定了具有纵向分布易变特点的ASR任务。还提供了一套官方发展套件。我们开发了一个完整的基于Kaldi的数据编制管道,以及隐藏的Markov模式(HMM)、混合的深神经网络(HMM-DNNN)和基于关注的编码脱coder(AED)配方。我们为官方测试组和最近使用的其他多个测试套件设定了基准。两个时间序列都已经非常庞大,而且我们发现,在官方测试中,ASR的绩效设定了基于纵向分布式的数据,而其他区域则比HNM数据持续地提高了对A的准确性数据进行比较。