ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions

1 Technical University of Munich
2 DisneyResearch|Studios, Switzerland

ScaffoldAvatar can synthesize ultra-high fidelity multi-view consistent photorealistic avatars. Our method is capable of generating realistic and high-quality animations including freckles and other fine facial details.

Abstract

Generating high-fidelity real-time animated sequences of photorealistic 3D head avatars is important for many graphics applications, including immersive telepresence and movies. This is a challenging problem particularly when rendering digital avatar close-ups for showing character's facial microfeatures and expressions. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple locally-defined facial expressions with 3D Gaussian splatting to enable creating ultra-high fidelity, expressive and photorealistic 3D head avatars. In contrast to previous works that operate on a global expression space, we condition our avatar's dynamics on patch-based local expression features and synthesize 3D Gaussians at a patch level. In particular, we leverage a patch-based geometric 3D face model to extract patch expressions and learn how to translate these into local dynamic skin appearance and motion by coupling the patches with anchor points of Scaffold-GS, a recent hierarchical scene representation. These anchors are then used to synthesize 3D Gaussians on-the-fly, conditioned by patch-expressions and viewing direction. We employ color-based densification and progressive training to obtain high-quality results and faster convergence for high resolution 3K training images. By leveraging patch-level expressions, ScaffoldAvatar consistently achieves state-of-the-art performance with visually natural motion, while encompassing diverse facial expressions and styles in real time.

Video





Method Overview

(a) Given a sequence of multiview images, we first run a 3D Face Tracker to obtain tracked meshes in consistent topology. (b) Next, we define a patch layout and compute patch centers, orientations, and positions in world space, given by a TBNP matrix \( \mathcal{T}_p \). This gives a per-patch coordinate frame, which we combine with blendweights \( \boldsymbol{\beta}_p \) (from PBS model, Sec. 4.2). (c) Finally, Scaffold-GS anchors \( \mathbf{A}_p \) are attached to the patches. Each anchor’s attributes — position \( \boldsymbol{\mu} \), scale \( \boldsymbol{s} \), opacity \( \alpha \), and feature \( \mathbf{f} \) — are optimized together with the global MLP \( \mathbf{G} \), per-patch expression MLPs \( \{ \mathbf{P}_0, \ldots, \mathbf{P}_{P-1} \} \), and scaffold MLPs \( \{ \mathcal{F}_{\boldsymbol{\mu}}, \mathcal{F}_{\alpha}, \mathcal{F}_{\mathbf{q}}, \mathcal{F}_{\mathbf{s}}, \mathcal{F}_{\mathbf{c}}^{1:P} \} \) to decode Gaussian features. Gaussian primitives are predicted in the local coordinate frame of each patch and deformed to global space with the tracked mesh, resulting in a high-quality, re-animatable 3DGS-based avatar.



Baseline Comparisons


We compare against state-of-the-art baselines and show results for Novel View Synthesis, Self-Reenactment and Cross-Reenactment.

Avatar FlyThrough


We show flythrough of our avatar in a fixed neutal expression (left) and with dynamic expressions (right).





BibTeX

If you find this work useful for your research, please consider citing: