TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

Christoph Minixhofer, Ondrej Klejch, Peter Bell

Datasets, Benchmarks & Evaluation Fri, Apr 24 · 11:18 AM–11:28 AM · 201 A/B Avg rating: 5.33 (2–8)

Author-provided TL;DR

With TTSDS2, we introduce a metric and benchmark for TTS, covering 14 languages, which consistently correlates with human judgements.

Abstract

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for recreating a multilingual test dataset to avoid data leakage; and a benchmark for TTS in 14 languages.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

TTSDS2 metric robustly correlates with human judgments for TTS evaluation across diverse speech domains maintaining >0.5 Spearman correlation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

TTSDS2 metric maintaining >0.5 Spearman correlation with human judgments across all tested domains
Dataset with 11,000+ subjective opinion score ratings for synthetic speech evaluation
Benchmark covering 14 languages with automated pipeline preventing data contamination

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

objective TTS metrics
Spearman correlation analysis

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

TTS evaluation datasets in 14 languages

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Uses CPU-bound Wasserstein distance computations limiting efficiency compared to alternatives
from the paper
Never surpasses 0.8 Spearman correlation indicating listening tests contain inherently noisy components
from the paper
Does not capture context utterances were spoken in or include long-form samples beyond 30 seconds
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Explore compute-efficient alternatives like Maximum Mean Discrepancy
from the paper
Address failure cases like unfaithful transcript reproduction
from the paper
Include long-form samples and contextual information
from the paper

Author keywords

speech synthesis
distributional analysis
objective evaluation

Something off? Let us know →

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

Abstract

Author keywords

Related orals

On the Wasserstein Geodesic Principal Component Analysis of probability measures

TabStruct: Measuring Structural Fidelity of Tabular Data

Monocular Normal Estimation via Shading Sequence Estimation

World-In-World: World Models in a Closed-Loop World

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits