" 唱声合成 " 中保存小菜 (Pitch Preservation In Singing Voice Synthesis)

Suffering from limited singing voice corpus, existing singing voice synthesis (SVS) methods that build encoder-decoder neural networks to directly generate spectrogram could lead to out-of-tune issues during the inference phase. To attenuate these issues, this paper presents a novel acoustic model with independent pitch encoder and phoneme encoder, which disentangles the phoneme and pitch information from music score to fully utilize the corpus. Specifically, according to equal temperament theory, the pitch encoder is constrained by a pitch metric loss that maps distances between adjacent input pitches into corresponding frequency multiples between the encoder outputs. For the phoneme encoder, based on the analysis that same phonemes corresponding to varying pitches can produce similar pronunciations, this encoder is followed by an adversarially trained pitch classifier to enforce the identical phonemes with different pitches mapping into the same phoneme feature space. By these means, the sparse phonemes and pitches in original input spaces can be transformed into more compact feature spaces respectively, where the same elements cluster closely and cooperate mutually to enhance synthesis quality. Then, the outputs of the two encoders are summed together to pass through the following decoder in the acoustic model. Experimental results indicate that the proposed approaches can characterize intrinsic structure between pitch inputs to obtain better pitch synthesis accuracy and achieve superior singing synthesis performance against the advanced baseline system.

翻译：从有限的歌声功能中,现有的歌声合成(SVS)方法(SVS)从有限的歌声声功能中可以建立编码器-decoder神经网络,直接生成光谱图谱阶段可能导致音调问题。为了缓解这些问题,本文展示了一个新的声学模型,其中有独立的音调编码器和电话编码器,将电话调频和音频调信息从音乐评分中分离出来,以充分利用音频。具体来说,根据同等的调情理论,音频编码器受音频测试仪(SVS)方法的制约,它绘制相邻输入器之间的距离,以显示编码器输出的频率乘以相应的倍数。对于电话编码器来说,基于对不同音频的同音频可产生类似音频调编码器的分析,该编码器提供了一种全新的音调模型模型模型模型模型,然后将相同的音调音频转换成更精确的音频分析结果。通过这些手段,原始输入空间的稀薄的音频和音频输入器可以分别转换为较紧的地段空间,在这个空间中,同的音频调元素组元素组组组组合可以接近地组组合,然后通过感化分析结果,在同步结构内化结果中实现更精确地段结构中,从而形成更精确地段。