The field of Automatic Music Generation has seen significant progress thanks to the advent of Deep Learning. However, most of these results have been produced by unconditional models, which lack the ability to interact with their users, not allowing them to guide the generative process in meaningful and practical ways. Moreover, synthesizing music that remains coherent across longer timescales while still capturing the local aspects that make it sound ``realistic'' or ``human-like'' is still challenging. This is due to the large computational requirements needed to work with long sequences of data, and also to limitations imposed by the training schemes that are often employed. In this paper, we propose a generative model of symbolic music conditioned by data retrieved from human sentiment. The model is a Transformer-GAN trained with labels that correspond to different configurations of the valence and arousal dimensions that quantitatively represent human affective states. We try to tackle both of the problems above by employing an efficient linear version of Attention and using a Discriminator both as a tool to improve the overall quality of the generated music and its ability to follow the conditioning signals.
翻译:由于深层学习的到来,自动音乐制作领域取得了显著进展。然而,大部分成果都是由无条件模型产生的,这些模型缺乏与用户互动的能力,不允许用户以有意义和实用的方式指导基因化过程。此外,合成音乐在较长的时间尺度上保持一致性,同时仍然捕捉当地方面,使其听起来像“现实主义”或“类似人类”一样,仍然具有挑战性。这是因为需要大量计算,才能用大量的数据序列来工作,而且经常使用的培训计划也造成了限制。在本文中,我们提出了一个象征音乐的基因模型,以从人类情感中检索的数据为条件。模型是一个变形器-GAN,其标签与价值的不同配置和振奋的尺寸相对应,在数量上代表着人类的状态。我们试图解决上述两个问题,方法是使用高效的线性关注版本,以及使用阻断器作为工具,提高所生成音乐的总体质量及其跟踪调节信号的能力。