Subtitles are essential for video accessibility and audience engagement. Modern Automatic Speech Recognition (ASR) systems, built upon Encoder-Decoder neural network architectures and trained on massive amounts of data, have progressively reduced transcription errors on standard benchmark datasets. However, their performance in real-world production environments, particularly for non-English content like long-form Italian videos, remains largely unexplored. This paper presents a case study on developing a professional subtitling system for an Italian media company. To inform our system design, we evaluated four state-of-the-art ASR models (Whisper Large v2, AssemblyAI Universal, Parakeet TDT v3 0.6b, and WhisperX) on a 50-hour dataset of Italian television programs. The study highlights their strengths and limitations, benchmarking their performance against the work of professional human subtitlers. The findings indicate that, while current models cannot meet the media industry's accuracy needs for full autonomy, they can serve as highly effective tools for enhancing human productivity. We conclude that a human-in-the-loop (HITL) approach is crucial and present the production-grade, cloud-based infrastructure we designed to support this workflow.
翻译:字幕对于视频可访问性和观众参与至关重要。基于编码器-解码器神经网络架构并利用海量数据训练的现代自动语音识别系统,已在标准基准数据集上逐步降低了转录错误率。然而,这些系统在真实生产环境中的表现,特别是针对意大利语长视频等非英语内容,仍缺乏充分研究。本文通过为意大利媒体公司开发专业字幕系统的案例研究,评估了四种前沿ASR模型在50小时意大利电视节目数据集上的表现。研究以专业人工字幕员的工作为基准,系统比较了Whisper Large v2、AssemblyAI Universal、Parakeet TDT v3 0.6b和WhisperX模型的优势与局限。结果表明,虽然现有模型尚无法完全满足媒体行业对全自动化的精度需求,但可作为显著提升人工效率的有效工具。我们强调人机协同模式的关键性,并展示了为支持该工作流设计的生产级云端基础设施。