Large language models (LLMs) have advanced in text and vision, but their reasoning on audio remains limited. Most existing methods rely on dense audio embeddings, which are difficult to interpret and often fail on structured reasoning tasks. Caption-based approaches, introduced in recent benchmarks such as MMAU, improve performance by translating audio into text, yet still depend on dense embeddings as input, offering little insight when models fail. We present SAR-LM, a symbolic audio reasoning pipeline that builds on this caption-based paradigm by converting audio into structured, human-readable features across speech, sound events, and music. These symbolic inputs support both reasoning and transparent error analysis, enabling us to trace failures to specific features. Across three benchmarks, MMAU, MMAR, and OmniBench, SAR-LM achieves competitive results, while prioritizing interpretability as its primary contribution.
翻译:大语言模型(LLMs)在文本和视觉领域已取得显著进展,但其在音频上的推理能力仍较为有限。现有方法大多依赖稠密音频嵌入表示,这类表示难以解释,且在结构化推理任务中常出现失败。基于音频描述的方法(如近期MMAU等基准测试中引入的)通过将音频转换为文本来提升性能,但仍以稠密嵌入作为输入,当模型失效时难以提供可理解的洞察。本文提出SAR-LM,一种符号化音频推理流程,它在基于描述的范式基础上,将音频转换为涵盖语音、声音事件和音乐的结构化、人类可读的特征。这些符号化输入同时支持推理和透明的错误分析,使我们能够将失败追溯至具体特征。在MMAU、MMAR和OmniBench三个基准测试中,SAR-LM取得了具有竞争力的结果,同时将可解释性作为其核心贡献。