We introduce an unsupervised approach for correcting highly imperfect speech transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning. Transcripts are acquired from videos by extracting audio using Ffmpeg framework and further converting audio to text transcript using Google API. In the benchmark LRW dataset, there are 500 word categories, and 50 videos per class in mp4 format. All videos consist of 29 frames (each 1.16 s long) and the word appears in the middle of the video. In our approach we tried to improve the baseline accuracy from 9.34% by using stemming, phoneme extraction, filtering and pruning. After applying the stemming algorithm to the text transcript and evaluating the results, we achieved 23.34% accuracy in word recognition. To convert words to phonemes we used the Carnegie Mellon University (CMU) pronouncing dictionary that provides a phonetic mapping of English words to their pronunciations. A two-way phoneme pruning is proposed that comprises of the two non-sequential steps: 1) filtering and pruning the phonemes containing vowels and plosives 2) filtering and pruning the phonemes containing vowels and fricatives. After obtaining results of stemming and two-way phoneme pruning, we applied decision-level fusion and that led to an improvement of word recognition rate upto 32.96%.
翻译:我们引入了一种未经监督的方法来纠正高度不完善的语音笔录, 其依据是: 以决定级别混合制制制和双向电话线调制, 纠正高度不完善的语音笔录。 通过使用 Ffmpeg 框架提取音频, 并使用 Google API 进一步将音频转换为文本誊本, 从视频中获取了笔记本。 在基准 LRW 数据集中, 每类有500个字类别, 每类有 mp4 格式的50 个视频。 所有视频都包含 29 个框架( 每个1. 16 s long), 并在视频中间出现单词 。 在我们的方法中, 我们试图通过使用 冲压、 电话提取、 过滤、 过滤和 剪裁和 剪裁等两种非序列步骤来提高9. 34 的基线精确度。 在对文本抄录记录和评估结果进行评估后, 我们实现了23.34% 的准确度。 要将文字转换成通音频段, 我们用双端的音阶级调整后, 获取包含誓言和感应和感升级和感升级结果。