In this paper, we present an open-source software for developing a nonparallel voice conversion (VC) system named crank. Although we have released an open-source VC software based on the Gaussian mixture model named sprocket in the last VC Challenge, it is not straightforward to apply any speech corpus because it is necessary to prepare parallel utterances of source and target speakers to model a statistical conversion function. To address this issue, in this study, we developed a new open-source VC software that enables users to model the conversion function by using only a nonparallel speech corpus. For implementing the VC software, we used a vector-quantized variational autoencoder (VQVAE). To rapidly examine the effectiveness of recent technologies developed in this research field, crank also supports several representative works for autoencoder-based VC methods such as the use of hierarchical architectures, cyclic architectures, generative adversarial networks, speaker adversarial training, and neural vocoders. Moreover, it is possible to automatically estimate objective measures such as mel-cepstrum distortion and pseudo mean opinion score based on MOSNet. In this paper, we describe representative functions developed in crank and make brief comparisons by objective evaluations.
翻译:在本文中,我们展示了一个开发非平行语音转换(VC)系统的开源软件,名为曲柄。虽然我们已经发布了一个基于上一个VC挑战中名为螺旋石的Gaussian混合物模型的开放源码VC软件,但应用任何语音材料并不简单,因为有必要编制平行的源和目标演讲者言论,以模拟统计转换功能。为了解决这一问题,我们在本研究报告中开发了一个新的开源码软件,使用户能够仅使用一个非平行的语音资料库来模拟转换功能。为了实施VC软件,我们使用了一种矢量放大变异自动coder(VQVAE)软件。为了迅速审查最近在这个研究领域开发的技术的有效性,Central还支持了若干基于自动电解码的VC方法的代表性工作,例如使用等级结构、环球结构、配比网络、扬声器对抗网络、保音器训练以及神经电动电动电动调。此外,我们还可以自动估计客观措施,例如Mel-cstrum 扭曲和虚拟目标比较,我们用MOSNet 进行简要的测算。