Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence. Autoregressive models, such as WaveNet, model local structure at the expense of global latent structure and slow iterative sampling, while Generative Adversarial Networks (GANs), have global latent conditioning and efficient parallel sampling, but struggle to generate locally-coherent audio waveforms. Herein, we demonstrate that GANs can in fact generate high-fidelity and locally-coherent audio by modeling log magnitudes and instantaneous frequencies with sufficient frequency resolution in the spectral domain. Through extensive empirical investigations on the NSynth dataset, we demonstrate that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.
翻译:高效的音频合成是一项固有的困难的机器学习任务,因为人类感知对于全球结构和精细波形一致性都十分敏感。 自动递减模型,如WaveNet(WaveNet),以全球潜伏结构和缓慢迭代取样为代价的模拟本地结构,而General Aversarial Nets(GANs)则具有全球潜在调节和高效平行取样功能,但努力生成与本地相容的音频波形。 在这里,我们证明GANs(GANs)事实上可以通过在光谱域内模拟日志量和瞬时频分辨率,生成高不易和本地兼容的音频。 通过对NSynth数据集的广泛经验性调查,我们证明GANs(GANs)能够超越自动化和人类评价指标的强大的波网基线,并高效生成比其自动反向对应方更快的音量级数级。