GaussianSpeech: Audio-Driven Gaussian Avatars
1Technical University of Munich
2Max Planck Institute for Intelligent Systems
3Technical University of Darmstadt
Play With Audio
Given input speech signal, GaussianSpeech can synthesize photorealistic 3D-consistent talking human head avatars. Our method can generate realistic and high-quality animations, including mouth interiors such as teeth, wrinkles, and specularities in the eyes. We show an example application by using ChatGPT to drive avatar's conversation.
Abstract
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple speech signal with 3D Gaussian splatting to create realistic, temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details, including wrinkles that occur with different expressions. To enable sequence modeling of 3D Gaussian splats with audio, we devise an audio-conditioned transformer model capable of extracting lip and expression features directly from audio input. Due to the absence of high-quality datasets of talking humans in correspondence with audio, we captured a new large-scale multi-view dataset of audio-visual sequences of talking humans with native English accents and diverse facial geometry. GaussianSpeech consistently achieves state-of-the-art performance with visually natural motion at real time rendering rates, while encompassing diverse facial expressions and styles.
Video
Results
Sound On
Method Overview
Person-specific 3D Avatar: We compute 3D face tracking and bind 3D Gaussians to the triangles of the tracked FLAME mesh. We apply volume-based pruning to prevent optimization to generate large amount of Gaussians, and apply subdivision of mesh triangles in the mouth region. We train color MLP \(\theta_\textrm{color}\) to synthesize expression & view dependent color. We apply wrinkle regularization and perceptual losses to improve photorealism.
From the given speech signal, GaussianSpeech uses Wav2Vec 2.0 encoder to extract generic audio features and maps them to personalized lip feature embeddings \(\boldsymbol{c}^{1:T}\) with Lip Transformer Encoder and wrinkle features \(\boldsymbol{w}^{1:T}\) with Wrinkle Transformer Encoder. Next, the Expression Encoder synthesizes FLAME expressions \(\boldsymbol{e}^{1:T}\) which are then projected via Expression2Latent MLP and concatenated with \(\boldsymbol{c}^{1:T}\) for input to the motion decoder. The motion decoder employs a multi-head transformer decoder consisting of Multihead Self-Attention, Cross-Attention, and Feed Forward layers. The concatenated lip-expression features are fused into the decoder via cross-attention layers with alignment mask \(\mathcal{M}\). The decoder then predicts FLAME vertex offsets \(\{ \boldsymbol{V}_\textrm{offset} \}^{1:T}\) which gets added to the template mesh \(\boldsymbol{T}\) to generate vertex animation in canonical space. During training, these are then fed to our optimized 3DGS avatar and the color MLP \(\boldsymbol{\theta}_\textrm{color}\) and gaussian latents \(\boldsymbol{z}\) are further refined via re-rendering losses.
Dataset
We collected a new multiview dataset with 16 cameras for 6 native English participants captured at 30 FPS and 3208x2200 resolution with overall recordings of \(\sim3.5\) hours, an order of magnitude larger than the existing datasets. Our dataset consists of more than 400 short and long sequences per person. We will make the dataset and the corresponding 3D face trackings publicly available for research purposes. Below ee show some sequence from our dataset (unmute to listen audio).
BibTeX
If you find this work useful for your research, please consider citing:
@misc{aneja2024gaussianspeech,
title={GaussianSpeech: Audio-Driven Gaussian Avatars},
author={Shivangi Aneja and Artem Sevastopolsky and Tobias Kirschstein and Justus Thies and Angela Dai and Matthias Nießner},
year={2024},
eprint={2411.18675},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.18675},
}