PhraseVAE与PhraseLDM：面向全曲多轨符号音乐生成的潜在扩散模型 (PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation)

This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses an arbitrary variable-length polyphonic note sequence into a single compact 64-dimensional phrase-level latent representation with high reconstruction fidelity, allowing a well-structured latent space and efficient generative modeling. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes at 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.

翻译：本技术报告提出了一种全新的全曲符号音乐生成范式。现有符号模型基于音符属性标记运行，存在序列极长、上下文长度受限以及对长程结构支持薄弱的问题。为解决这些挑战，我们引入了PhraseVAE与PhraseLDM——首个专为全曲多轨符号音乐设计的潜在扩散框架。PhraseVAE将任意可变长度的复调音符序列压缩为单一紧凑的64维短语级潜在表示，在保持高重建保真度的同时，实现了结构良好的潜在空间与高效的生成建模。基于此潜在空间构建的PhraseLDM无需任何自回归组件，即可单次生成完整的多轨乐曲。该系统摒弃了逐小节的序列建模方式，支持长达128小节（以64bpm计约8分钟）的音乐生成，并能产出具有连贯局部织体、地道乐器模式及清晰全局结构的完整乐曲。仅需4500万参数，本框架即可在数秒内生成全曲，同时保持具有竞争力的音乐质量与生成多样性。综合而言，这些结果表明短语级潜在扩散为符号音乐生成长序列建模提供了高效且可扩展的解决方案。我们期望此项工作能推动未来符号音乐研究超越音符属性标记的局限，将短语级单元视为更具效能与音乐意义的建模目标。