Variation in speech is often represented and investigated using phonetic transcriptions, but transcribing speech is time-consuming and error prone. To create reliable representations of speech independent from phonetic transcriptions, we investigate the extraction of acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native and native speakers of English, and evaluate these differences by comparing them with human native-likeness judgments. We show that Transformer-based speech representations lead to significant performance gains over the use of phonetic transcriptions, and find that feature-based use of Transformer models is most effective with one or more middle layers instead of the final layer. We also demonstrate that these neural speech representations not only capture segmental differences, but also intonational and durational differences that cannot be represented by a set of discrete symbols used in phonetic transcriptions.
翻译:语音变异往往使用语音转录法来表示和调查语言变异性,但转换语言耗费时间且容易出错。 为建立独立于语音转录法的可靠语音表达方式,我们调查从若干自我监督的神经模型中提取的声学嵌入器。 我们使用这些表达方式来计算英语非本地和本地语言使用者之间基于字的发音差异,并通过将这些差异与人类本地相似性判断进行比较来评估这些差异。 我们显示,基于变异器的语音表述方式在使用语音转录法方面带来显著的绩效收益,并发现基于地貌的变异模型在一个或多个中间层而不是最后层最为有效。 我们还表明,这些神经语音表达方式不仅捕捉到部分差异,而且捕捉到无法由一组用于语音转录的离散符号所代表的民族和持续时间差异。