基于多模态大型基础模型的歌唱音色流行度评估 (Singing Timbre Popularity Assessment Based on Multimodal Large Foundation Model)

Automated singing assessment is crucial for education and entertainment. However, existing systems face two fundamental limitations: reliance on reference tracks, which stifles creative expression, and the simplification of complex performances into non-diagnostic scores based solely on pitch and rhythm. We advocate for a shift from discriminative to descriptive evaluation, creating a complete ecosystem for reference-free, multi-dimensional assessment. First, we introduce Sing-MD, a large-scale dataset annotated by experts across four dimensions: breath control, timbre quality, emotional expression, and vocal technique. Our analysis reveals significant annotation inconsistencies among experts, challenging the validity of traditional accuracy-based metrics. Second, addressing the memory limitations of Multimodal Large Language Models (MLLMs) in analyzing full-length songs, we propose VocalVerse. This efficient hybrid architecture leverages a lightweight acoustic encoder to model global performance features and long-term dependencies. Third, to address automated metric shortcomings, we establish the H-TPR (Human-in-the-loop Tiered Perceptual Ranking) benchmark, which evaluates a model's ability to generate perceptually valid rankings rather than predicting noisy ground-truth scores.

翻译：自动化歌唱评估对于教育和娱乐至关重要。然而，现有系统面临两个基本局限：一是对参考音轨的依赖，这抑制了创造性表达；二是将复杂表演简化为仅基于音高和节奏的非诊断性评分。我们主张从判别式评估转向描述性评估，构建一个完整的、无需参考的多维度评估生态系统。首先，我们引入了Sing-MD，这是一个由专家在四个维度（气息控制、音色质量、情感表达和声乐技巧）上标注的大规模数据集。我们的分析揭示了专家间显著的标注不一致性，这对传统基于准确性的度量指标的有效性提出了挑战。其次，针对多模态大语言模型（MLLMs）在分析完整歌曲时面临的内存限制，我们提出了VocalVerse。这种高效的混合架构利用轻量级声学编码器来建模全局表演特征和长期依赖关系。第三，为应对自动化度量指标的不足，我们建立了H-TPR（人在回路分层感知排序）基准，该基准评估模型生成感知有效排序的能力，而非预测带有噪声的真实评分。