Automatic Speech Recognition (ASR) systems are evaluated using Word Error Rate (WER), which is calculated by comparing the number of errors between the ground truth and the transcription of the ASR system. This calculation, however, requires manual transcription of the speech signal to obtain the ground truth. Since transcribing audio signals is a costly process, Automatic WER Evaluation (e-WER) methods have been developed to automatically predict the WER of a speech system by only relying on the transcription and the speech signal features. While WER is a continuous variable, previous works have shown that positing e-WER as a classification problem is more effective than regression. However, while converting to a classification setting, these approaches suffer from heavy class imbalance. In this paper, we propose a new balanced paradigm for e-WER in a classification setting. Within this paradigm, we also propose WER-BERT, a BERT based architecture with speech features for e-WER. Furthermore, we introduce a distance loss function to tackle the ordinal nature of e-WER classification. The proposed approach and paradigm are evaluated on the Librispeech dataset and a commercial (black box) ASR system, Google Cloud's Speech-to-Text API. The results and experiments demonstrate that WER-BERT establishes a new state-of-the-art in automatic WER estimation.
翻译:自动语音识别(ASR)系统使用Word 错误率(WER)来评估语音识别(ASR)系统。WER是一个连续变量,而以前的工作显示,将e-WER作为分类问题比回归更有效。然而,在转换为分类设置时,这些方法存在严重的阶级不平衡。在本文中,我们提议在分类设置中为e-WER提供一个新的平衡模式。在这个模式中,我们还提议WER-BERT(e-WERT),一个基于语音特征的BERT架构。此外,我们引入了远程损失功能,以解决e-WER分类的正常性质。在Librispeech数据设置和商业上对e-WER-SERAVA-AVA-SERAAAAA-CRAVA-BAVERAVAVAFLA-WERAVERABLLA AS AS ASLARS-WERARS AR AS AS AR-WERADRI ASU AR ASULOLI AR AR ASUDRIDRI AR AR AR AS AR AR ASUDRI AR AR ARB ARB AS AS AS AS AS AR ASU AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AR AR AR AS AS AR AR AR AR AR AS 。