2024-07-11
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Holistic framework. Instead of generating video frames directly, holistic facial dynamics and head motion are generated in a latent space, conditioned on audio and other signals. Given these motion latent codes, video frames are generated via a face decoder that also accepts appearance and identity features extracted from the input image as input.
A facial latent space is constructed and the facial encoder and decoder are trained.
We design and train a facial latent learning framework with expressive and separable features on real face videos. We then train a diffusive transformer to model the motion distribution and generate motion latent codes at test time conditioned on audio and other conditions.