Instilling an Active Mind in Avatars via Cognitive Simulation
Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Weifeng Chen, XING WANG, Yuan Zhang, Mingyuan Gao
This paper introduces a novel framework that uses a Large Language Model (LLM) for semantic guidance and a Multimodal Diffusion Transformer (DiT) for fusion to generate expressive, context-aware video avatars, demonstrating competitive performance
Abstract
Current video avatar models can generate fluid animations but struggle to capture a character's authentic essence, primarily synchronizing motion with low-level audio cues instead of understanding higher-level semantics like emotion or intent. To bridge this gap, we propose a novel framework for generating character animations that are not only physically plausible but also semantically rich and expressive. Our model is built on two technical innovations. First, we employ Multimodal Large Language Models to generate a structured textual representation from input conditions, providing high-level semantic guidance for creating contextually and emotionally resonant actions. Second, to ensure robust fusion of multimodal signals, we introduce a specialized Multimodal Diffusion Transformer architecture featuring a novel Pseudo Last Frame design. This allows our model to accurately interpret the joint semantics of audio, images and text, generating motions that are deeply coherent with the overall context. Comprehensive experiments validate the superiority of our method, which achieves compelling results in lip-sync accuracy, video quality, motion naturalness, and semantic consistency. The approach also shows strong generalization to challenging scenarios, including multi-person and non-human subjects. Our video results are linked in https://omnihuman-lab.github.io/v1_5/ .
Avatar generation framework using MLLM semantic planning and specialized MMDiT for coherent character animations aligned with multimodal context.
- Novel paradigm simulating deliberative System 2 thinking for video avatar generation beyond reactive behavior
- MLLM-based agents generating structured semantic guidance for character animations
- Specialized Multimodal Diffusion Transformer with Pseudo Last Frame design for multimodal signal fusion
- multimodal large language models
- diffusion transformers
- semantic planning
Minor artifacts at synthesis level and occasional motion reasoning imperfections
from the paperModel scaling and stronger dataset quality needed to improve fundamental generation quality
from the paper
Continue scaling with larger, higher-quality datasets
from the paperInvestigate more sophisticated integration strategies with end-to-end joint training of DiT and LLM
from the paper
Author keywords
- Video Generatio
- Human Animation
- Avatar
- Multimedia
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.