Instilling an Active Mind in Avatars via Cognitive Simulation

Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Weifeng Chen, XING WANG, Yuan Zhang, Mingyuan Gao

LLMs & Reasoning Sat, Apr 25 · 10:42 AM–10:52 AM · 201 A/B Avg rating: 7.00 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

This paper introduces a novel framework that uses a Large Language Model (LLM) for semantic guidance and a Multimodal Diffusion Transformer (DiT) for fusion to generate expressive, context-aware video avatars, demonstrating competitive performance

Abstract

Current video avatar models can generate fluid animations but struggle to capture a character's authentic essence, primarily synchronizing motion with low-level audio cues instead of understanding higher-level semantics like emotion or intent. To bridge this gap, we propose a novel framework for generating character animations that are not only physically plausible but also semantically rich and expressive. Our model is built on two technical innovations. First, we employ Multimodal Large Language Models to generate a structured textual representation from input conditions, providing high-level semantic guidance for creating contextually and emotionally resonant actions. Second, to ensure robust fusion of multimodal signals, we introduce a specialized Multimodal Diffusion Transformer architecture featuring a novel Pseudo Last Frame design. This allows our model to accurately interpret the joint semantics of audio, images and text, generating motions that are deeply coherent with the overall context. Comprehensive experiments validate the superiority of our method, which achieves compelling results in lip-sync accuracy, video quality, motion naturalness, and semantic consistency. The approach also shows strong generalization to challenging scenarios, including multi-person and non-human subjects. Our video results are linked in https://omnihuman-lab.github.io/v1_5/ .

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Avatar generation framework using MLLM semantic planning and specialized MMDiT for coherent character animations aligned with multimodal context.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Novel paradigm simulating deliberative System 2 thinking for video avatar generation beyond reactive behavior
MLLM-based agents generating structured semantic guidance for character animations
Specialized Multimodal Diffusion Transformer with Pseudo Last Frame design for multimodal signal fusion

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

multimodal large language models
diffusion transformers
semantic planning

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Minor artifacts at synthesis level and occasional motion reasoning imperfections
from the paper
Model scaling and stronger dataset quality needed to improve fundamental generation quality
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Continue scaling with larger, higher-quality datasets
from the paper
Investigate more sophisticated integration strategies with end-to-end joint training of DiT and LLM
from the paper

Author keywords

Video Generatio
Human Animation
Avatar
Multimedia

Something off? Let us know →

Instilling an Active Mind in Avatars via Cognitive Simulation

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis