基于大语言模型与采样边缘化的音素驱动语音识别 (Phoneme-based speech recognition driven by large language models and sampling marginalization)

Recently, the Large Language Model-based Phoneme-to-Grapheme (LLM-P2G) method has shown excellent performance in speech recognition tasks and has become a feasible direction to replace the traditional WFST decoding method. This framework takes into account both recognition accuracy and system scalability through two-stage modeling of phoneme prediction and text generation. However, the existing LLM-P2G adopts the Top-K Marginalized (TKM) training strategy, and its candidate phoneme sequences rely on beam search generation, which has problems such as insufficient path diversity, low training efficiency, and high resource overhead. To this end, this paper proposes a sampling marginalized training strategy (Sampling-K Marginalized, SKM), which replaces beam search with random sampling to generate candidate paths, improving marginalized modeling and training efficiency. Experiments were conducted on Polish and German datasets, and the results showed that SKM further improved the model learning convergence speed and recognition performance while maintaining the complexity of the model. Comparative experiments with a speech recognition method that uses a projector combined with a large language model (SpeechLLM) also show that the SKM-driven LLM-P2G has more advantages in recognition accuracy and structural simplicity. The study verified the practical value and application potential of this method in cross-language speech recognition systems.

翻译：近期，基于大语言模型的音素到字素转换方法在语音识别任务中展现出优异性能，已成为替代传统WFST解码的可行方向。该框架通过音素预测与文本生成的两阶段建模，兼顾了识别准确率与系统可扩展性。然而，现有LLM-P2G采用Top-K边缘化训练策略，其候选音素序列依赖束搜索生成，存在路径多样性不足、训练效率低、资源开销大等问题。为此，本文提出采样边缘化训练策略，通过随机采样替代束搜索生成候选路径，提升了边缘化建模与训练效率。在波兰语和德语数据集上的实验表明，SKM在保持模型复杂度的同时，进一步提升了模型学习收敛速度与识别性能。与采用投影器结合大语言模型的语音识别方法的对比实验也显示，SKM驱动的LLM-P2G在识别准确率与结构简洁性方面更具优势。研究验证了该方法在跨语言语音识别系统中的实用价值与应用潜力。