The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system's functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.
翻译:音频-视觉语音识别(AVSR)系统在实际部署中面临根本性挑战,即在现实环境(以不可预测的声学噪声和视觉干扰为特征)中性能显著下降。本论文主张,采用系统化、层次化的方法对于克服这些挑战至关重要,以实现表征、架构和系统层面的鲁棒可扩展性。在表征层面,我们研究构建统一模型的方法,该模型学习对多样现实世界干扰具有内在鲁棒性的音频-视觉特征,从而无需专用模块即可泛化至新环境。针对架构可扩展性,我们探索如何高效扩展模型容量,同时确保多模态输入的自适应可靠使用,开发了一个基于输入特征智能分配计算资源的框架。最后,在系统层面,我们提出通过与大规规模基础模型的模块化集成来扩展系统功能的方法,利用其强大的认知与生成能力以最大化最终识别准确率。通过在这三个层面系统性地提供解决方案,本论文旨在构建一个在现实应用中具有高可靠性的新一代鲁棒且可扩展的AVSR系统。