EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

Dingdong WANG, Shujie LIU, Tianhua Zhang, Youjun Chen, Jinyu Li, Helen M. Meng

LLMs & Reasoning Thu, Apr 23 · 3:27 PM–3:37 PM · Amphitheater Avg rating: 6.50 (6–8)

Abstract

Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs’ expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR}) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

EmotionThinker reformulates speech emotion recognition as deep reasoning with prosody enhancement and specialized reinforcement learning.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

First work reformulating speech emotion recognition as deep reasoning problem through reinforcement learning
Constructs EmotionCoT-35K dataset with chain-of-thought annotations and detailed captions for emotional reasoning
Develops prosody-enhanced foundation model EmotionThinker-Base addressing weak prosody perception in speech LLMs
Introduces GRPO-PTR combining outcome rewards with progressive reasoning rewards dynamically adjusted by trustworthiness

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Reinforcement learning
Chain-of-thought reasoning
Prosody enhancement
Reward models

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

EmotionCoT-35K

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Speech Emotion Recognition
Speech LLMs
Speech Processing
Reinforcement Learning

Something off? Let us know →

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis