In this paper, we propose a novel text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions in accordance with contextual sentiments as well as speech rhythm and pauses. To be specific, our framework consists of a speaker-independent stage and a speaker-specific stage. In the speaker-independent stage, we design three parallel networks to generate animation parameters of the mouth, upper face, and head from texts, separately. In the speaker-specific stage, we present a 3D face model guided attention network to synthesize videos tailored for different individuals. It takes the animation parameters as input and exploits an attention mask to manipulate facial expression changes for the input individuals. Furthermore, to better establish authentic correspondences between visual motions (i.e., facial expression changes and head movements) and audios, we leverage a high-accuracy motion capture dataset instead of relying on long videos of specific individuals. After attaining the visual and audio correspondences, we can effectively train our network in an end-to-end fashion. Extensive experiments on qualitative and quantitative results demonstrate that our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms and outperforms the state-of-the-art.
翻译:在本文中,我们提出一个新的基于文字的谈话头视频生成框架,根据背景情感以及言语节奏和暂停,综合高不洁的面部表情和头部动作。具体地说,我们的框架包括一个独立发言的阶段和一个特定发言的阶段。在独立发言的阶段,我们设计了三个平行的网络,分别从文本中产生口腔、上脸和头部的动画参数。在特定发言者的阶段,我们展示了一个三维面部引导示范关注网络,以合成为不同个人定制的视频。它将动画参数作为投入,并利用关注面部遮罩,为输入的个人操控面部表达变化。此外,为了更好地在视觉动作(即面部表达变化和头部运动)和音频之间建立真实的对应关系,我们利用一个高度精确的动作捕获数据集,而不是依赖特定个人长长的视频。在取得视觉和声频通信后,我们可以有效地以端对我们的网络进行培训。关于定性和定量结果的广泛实验表明我们的算法实现了高品质的摄影现实面面部谈话式语音节奏,包括各种面部和头部动作。