We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.
翻译:我们提出感知编码器视听版(PE-AV),这是一个通过规模化对比学习训练的音频与视频理解编码器新系列。基于PE架构,PE-AV在多个关键维度作出贡献:将表征扩展至音频领域,并原生支持跨音频-视频、音频-文本及视频-文本模态的联合嵌入。PE-AV的统一跨模态嵌入实现了语音检索等新型任务,并在标准音频与视频基准测试中创造了新的性能纪录。我们通过构建强大的视听数据引擎实现这一突破,该引擎为O(1亿)量级的音视频对生成高质量描述文本,实现了跨模态的大规模一致性监督。我们的音频数据涵盖语音、音乐和通用音效,避免了先前工作中常见的单领域局限性。我们利用十组对比学习目标对,证明扩展跨模态与描述类型的配对能增强对齐效果并提升零样本性能。进一步地,我们通过帧级对比目标微调PE-AV,开发出PE-A-Frame模型,实现了音频帧与文本的细粒度对齐,可应用于声学事件检测等任务。