Technology Sharing

【Paper】VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

2024-07-11

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Holistic framework. Instead of generating video frames directly, holistic facial dynamics and head motion are generated in a latent space, conditioned on audio and other signals. Given these motion latent codes, video frames are generated via a face decoder that also accepts appearance and identity features extracted from the input image as input.

A facial latent space is constructed and the facial encoder and decoder are trained.

We design and train a facial latent learning framework with expressive and separable features on real face videos. We then train a diffusive transformer to model the motion distribution and generate motion latent codes at test time conditioned on audio and other conditions.