Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
翻译:人工智能的自动音乐生成通常需要大量数据,对于许多不太常见的流派和乐器来说,这些数据是很难获得的。为了解决这一问题,我们介绍了关于深层次模型将知识从语言向音乐转移的可能性的持续工作和初步调查结果,为此,我们微调了大型语言模型,这些模型在大量文字材料库中经过了培训,仅对数百个鼓声表演的MIDI文件进行了大量文字材料的预先培训。我们表明,通过这样做,最大、最先进的模型之一(GPT3)能够产生合理的鼓声,而未受过培训的模型(Transed)则显示除了天真地重复之外没有其他能力。评估所产生的音乐是一项艰巨的任务,在文献中几乎没有先例的情况下对鼓声槽进行评估更为困难。因此,我们提出了一种有针对性的结构评估方法,并分析了GPT3制作的鼓声库与人类专业人员的鼓声相比,通过语言到音乐的传输暴露出这种新一代的优缺点。我们的调查结果表明,语言到音乐的学习模式的传授是可行和有希望的。