We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
翻译:我们提出了Kunkado,一个从马里广播档案中汇编而成的160小时班巴拉语自动语音识别数据集,旨在捕捉涵盖广泛主题的当代自发语音。该数据集包含了实际ASR系统在真实世界使用中会遇到的语言转换、不流畅表达、背景噪声以及说话人重叠等现象。我们在一个33.47小时经人工审核的子集上对基于Parakeet的模型进行了微调,并应用了实用的文本归一化处理,以减少数字格式、标签和语码转换标注的变异性。在两个真实世界测试集上的评估表明,使用Kunkado进行微调后,一个测试集的词错误率从44.47%降至37.12%,另一个则从36.07%降至32.33%。在人工评估中,所得模型也优于一个具有相同架构、在98小时更清晰但真实性较低的语音数据上训练的可比系统。我们公开数据和模型,以支持对主要口语化语言的鲁棒自动语音识别研究。