通过相互培训创造多种音频说明 (Towards Generating Diverse Audio Captions via Adversarial Training)

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.

翻译：自动声带字幕是一项跨模式翻译任务,用于描述音频剪辑内容和自然语言句子的内容。这一任务吸引了越来越多的关注,近年来取得了实质性进展。现有模型产生的字幕一般忠实于音频剪辑的内容,然而,这些机器产生的字幕往往具有确定性(例如,为某个音频剪子制作固定字幕),简单(例如,使用通用词和简单的语法)和通用(例如,为类似的音频剪辑制作相同的标题),当人们被要求描述音频剪辑的内容时,不同的人往往侧重于不同的音频事件,并用不同的文字和语法描述不同方面的音频剪内容。我们认为,听力字幕系统应有能力生成不同的字幕,或者固定音频剪,或者类似的音频剪。为此,我们提议一个对抗性培训框架,以一个有条件的配制对抗性词调网络(C-GAN)为基础,改进音频字幕系统的多样性。一个标题发电机和两个混合制导师的比较性辩论和学习方法,用来制作和共同制作各种行为标准,例如导导导师标准,用来制作和导理学的模型,用来制作和导导师的,用来制作,用来制作或导导导导导导师的,用以显示任何导理学标,用以显示任何导的手制的模标准。