Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.
翻译:人类依赖多感官整合来感知空间环境,其中听觉线索使得声源能够在三维空间中被定位。尽管空间音频在VR/AR等沉浸式技术中扮演着关键角色,但现有的大多数多模态数据集仅提供单声道音频,这限制了空间音频生成与理解研究的发展。为应对这些挑战,我们推出了MRSAudio,一个旨在推动空间音频理解与生成研究的大规模多模态空间音频数据集。MRSAudio包含四个独立组成部分:MRSLife、MRSSpeech、MRSMusic与MRSSing,涵盖了多样化的真实世界场景。该数据集包含同步的双耳音频与高阶环绕声音频、外中心与内中心视角视频、运动轨迹,以及精细化的标注,如文字转录、音素边界、歌词、乐谱与提示文本。为展示MRSAudio的实用性与多功能性,我们建立了五项基础任务:音频空间化、空间文本到语音合成、空间歌声合成、空间音乐生成以及声学事件定位与检测。实验结果表明,MRSAudio能够实现高质量的空间建模,并支持广泛的空间音频研究。演示示例与数据集访问地址为:https://mrsaudio.github.io。