AnyTalker: Scaling Multi-person Talking Video Generation with Interactivity Refinement
Technique Report Code
We propose AnyTalker, an audio-driven framework for generating multi-person talking videos. It features a flexible multi-stream structure to scale identities while ensuring seamless inter-identity interactions.
Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features a flexible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity–audio pairs, enabling the drivable IDs to scale arbitrarily. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves excellent lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and generation fidelity.
In this work, we introduce AnyTalker, an audio-driven framework for generating multi-person talking videos. It presents a novel multi-stream structure called Audio-Face Cross Attention Layer that enables identity scaling while guaranteeing seamless cross-identity interactions.
Some materials and video sources are derived from real videos. The generated content is for academic use only and commercial use is not permitted.