ICLR 2026 Orals

Instilling an Active Mind in Avatars via Cognitive Simulation

Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Weifeng Chen, XING WANG, Yuan Zhang, Mingyuan Gao

LLMs & Reasoning Sat, Apr 25 · 10:42 AM–10:52 AM · 201 A/B Avg rating: 7.00 (6–8)
Author-provided TL;DR

This paper introduces a novel framework that uses a Large Language Model (LLM) for semantic guidance and a Multimodal Diffusion Transformer (DiT) for fusion to generate expressive, context-aware video avatars, demonstrating competitive performance

Abstract

Current video avatar models can generate fluid animations but struggle to capture a character's authentic essence, primarily synchronizing motion with low-level audio cues instead of understanding higher-level semantics like emotion or intent. To bridge this gap, we propose a novel framework for generating character animations that are not only physically plausible but also semantically rich and expressive. Our model is built on two technical innovations. First, we employ Multimodal Large Language Models to generate a structured textual representation from input conditions, providing high-level semantic guidance for creating contextually and emotionally resonant actions. Second, to ensure robust fusion of multimodal signals, we introduce a specialized Multimodal Diffusion Transformer architecture featuring a novel Pseudo Last Frame design. This allows our model to accurately interpret the joint semantics of audio, images and text, generating motions that are deeply coherent with the overall context. Comprehensive experiments validate the superiority of our method, which achieves compelling results in lip-sync accuracy, video quality, motion naturalness, and semantic consistency. The approach also shows strong generalization to challenging scenarios, including multi-person and non-human subjects. Our video results are linked in https://omnihuman-lab.github.io/v1_5/ .

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Avatar generation framework using MLLM semantic planning and specialized MMDiT for coherent character animations aligned with multimodal context.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Novel paradigm simulating deliberative System 2 thinking for video avatar generation beyond reactive behavior
  • MLLM-based agents generating structured semantic guidance for character animations
  • Specialized Multimodal Diffusion Transformer with Pseudo Last Frame design for multimodal signal fusion
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • multimodal large language models
  • diffusion transformers
  • semantic planning
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Minor artifacts at synthesis level and occasional motion reasoning imperfections
    from the paper
  • Model scaling and stronger dataset quality needed to improve fundamental generation quality
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Continue scaling with larger, higher-quality datasets
    from the paper
  • Investigate more sophisticated integration strategies with end-to-end joint training of DiT and LLM
    from the paper

Author keywords

  • Video Generatio
  • Human Animation
  • Avatar
  • Multimedia

Related orals

Something off? Let us know →