Large Audio Language Models (LALMs) have emerged with strong performance across diverse audio understanding tasks and can be further enhanced by neural audio codecs. Transitioning from multi-layer residual vector quantizers to a single-layer quantizer has been shown to facilitate more efficient downstream language models decoding. However, the ability of a single codebook to capture fine-grained acoustic details remains limited, as the frequency-variant nature of 1D tokenizers leads to redundancy. To address this issue, we propose MelTok, a two-dimensional (2D) tokenizer that effectively compresses acoustic details of 44.1 KHz audio into a single codebook. The tokenizer encodes audio into a more compact representation than one-dimensional tokenizers. Furthermore, to recover audio from mel-spectrogram tokens, we propose a token-based vocoder. Both objective and subjective evaluations demonstrate that MelTok achieves quality comparable to multi-codebook codecs and outperforms existing state-of-the-art neural codecs with a single codebook on high-fidelity audio reconstruction. By preserving acoustic details, MelTok offers a strong representation for downstream understanding tasks.
翻译:大型音频语言模型(LALMs)已在多种音频理解任务中展现出卓越性能,并可通过神经音频编解码器进一步增强。从多层残差矢量量化器过渡到单层量化器已被证明能促进下游语言模型更高效的解码。然而,单码本捕获细粒度声学细节的能力仍然有限,因为一维标记器的频率变化特性会导致冗余。为解决这一问题,我们提出了MelTok,一种二维(2D)标记器,能够将44.1 KHz音频的声学细节有效压缩至单一码本中。该标记器将音频编码为比一维标记器更紧凑的表示形式。此外,为从梅尔频谱图标记中恢复音频,我们提出了一种基于标记的声码器。客观与主观评估均表明,MelTok在音频重建质量上可与多码本编解码器相媲美,并在高保真音频重建方面优于现有单码本神经编解码器的先进技术。通过保留声学细节,MelTok为下游理解任务提供了强有力的表示基础。