The applications of short-termuser generated video(UGV),such as snapchat, youtube short-term videos, booms recently,raising lots of multimodal machine learning tasks. Amongthem, learning the correspondence between audio and vi-sual information from videos is a challenging one. Mostprevious work of theaudio-visual correspondence(AVC)learning only investigated on constrained videos or simplesettings, which may not fit the application of UGV. In thispaper, we proposed new principles for AVC and introduced anew framework to set sight on the themes of videos to facili-tate AVC learning. We also released the KWAI-AD-AudViscorpus which contained 85432 short advertisement videos(around 913 hours) made by users. We evaluated our pro-posed approach on this corpus and it was able to outperformthe baseline by 23.15% absolute differenc
翻译:短期用户制作的视频(UGV)的应用,如松式聊天、Youtube短期视频、Youtube短期视频、近期的繁荣、大量多式机器学习任务。其中,从视频中学习音频和Vi-Si-Si-Si-Si-信息是具有挑战性的。光学-视频通信(AVC)最先在有限的视频或简单设置上学习,这可能不符合UGV的应用。在本文件中,我们为AVC提出了新的原则,并引入了新的框架,以观察视频主题,从而了解AVC学习。我们还发布了KWAI-AD-Aud Viscorpsus,其中载有85432个用户制作的短广告视频(大约913小时)。我们评估了我们对这一材料的赞成方法,并且能够超过基线23.15%的绝对差异。