We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.
翻译:我们提出了一种新颖的联合视听编辑流程,旨在增强编辑后视频与其伴随音频之间的连贯性。我们的方法首先应用先进的视频编辑技术生成目标视频,随后执行音频编辑以与视觉变化保持一致。为实现这一目标,我们提出了一种新的视频到音频生成模型,该模型以源音频、目标视频和文本提示为条件。我们扩展了模型架构以融入条件音频输入,并提出了一种提升训练效率的数据增强策略。此外,我们的模型根据编辑的复杂性动态调整源音频的影响,在可能的情况下保留原始音频结构。实验结果表明,我们的方法在保持视听对齐和内容完整性方面优于现有方法。