Music performance synthesis aims to synthesize a musical score into a natural performance. In this paper, we borrow recent advances in text-to-speech synthesis and present the Deep Performer -- a novel system for score-to-audio music performance synthesis. Unlike speech, music often contains polyphony and long notes. Hence, we propose two new techniques for handling polyphonic inputs and providing a fine-grained conditioning in a transformer encoder-decoder model. To train our proposed system, we present a new violin dataset consisting of paired recordings and scores along with estimated alignments between them. We show that our proposed model can synthesize music with clear polyphony and harmonic structures. In a listening test, we achieve competitive quality against the baseline model, a conditional generative audio model, in terms of pitch accuracy, timbre and noise level. Moreover, our proposed model significantly outperforms the baseline on an existing piano dataset in overall quality.
翻译:音乐性能合成旨在将音乐评分合成成自然性能。 在本文中,我们借了文字到语音合成的最新进展,并展示了深表演者 -- -- 一种分到音的音乐性能合成新系统。与语言不同,音乐往往包含多调和长音调。因此,我们提出了两种处理多调输入和在变压器编码器脱coder模型中提供精细调制的新技术。为了培训我们提议的系统,我们提出了一套由配对录音和分数组成的新的小提琴数据集,以及它们之间的估计校准。我们展示了我们提议的模型可以用清晰的多调和声调结构合成音乐。在一次监听试验中,我们实现了与基线模型(一个条件性基因化音响模型)相比的竞争性质量,即音速率、音调和噪声水平。此外,我们提议的模型大大超出了现有钢琴数据总质量的基线。