Abstract:
Given the facial image of an individual along with audio, Talking face generation aims to synthesize portrait videos of the individual that are conditioned by the given audio. Existing methods focus on generating talking face videos conditioned on audio and portrait images or driving video. Moreover, existing methods that have attempted to synthesize talking face videos struggle to produce realistic head movements and facial expressions that align with the audio content. In order to tackle these issues, we introduce a novel framework for the synthesis of expressive talking faces solely based on textual input where the facial characteristics or name of the subject is passed as input along with the audio content to be spoken. Our work makes use of Facial Action Units (FAUs) to explicitly model the facial characteristics of the subject along with other implicit parameters responsible for talking face synthesis. The result is an expressive talking face which explicitly models lip synchronization with audio, head motion and facial expressions resulting in a photo-realistic emotional talking face.