MOS-Naturalness Scale

The MOS-Naturalness score estimated via NISQA-TTS (Non-Intrusive Speech Quality Assessment - Text-To-Speech) represents the perceived naturalness and human-likeness of synthetic speech. It follows the ITU-T P.85 absolute category rating (ACR) protocol, where listeners judge samples on a 1–5 scale:

Score	Descriptor
5 – Completely natural	Indistinguishable from real human speech; no audible synthetic artifacts.
4 – Mostly natural	Minor synthetic cues perceptible, but still convincingly human.
3 – Fairly natural	Noticeably synthetic yet pleasant and intelligible.
2 – Unnatural	Clearly synthetic or robotic; distracting artifacts or poor prosody.
1 – Completely unnatural	Strongly artificial, distorted, or mechanical; unpleasant to hear.

Practical Range and Expected Scores

Although the theoretical scale runs from 1 to 5, in real listening tests the effective range is narrower:

Human reference recordings typically score ≈ 4.5 – 4.8, rarely a perfect 5.0.
Modern neural TTS systems (e.g., Tacotron 2, VITS, Vall-E) achieve ≈ 3.9 – 4.4.
Mid-quality synthetic voices cluster around 2.5 – 3.5.
Clearly robotic or degraded speech falls below 2.0.

NISQA-TTS reproduces this empirical distribution: even for flawless natural recordings, predicted values seldom exceed 4.6 – 4.7, while highly natural TTS outputs occupy the 3.5 – 4.3 range.

Why the Ceiling Is Lower Than 5.0

This compression of the scale is not a modeling flaw but a human-rating phenomenon known as central tendency bias. When people use bounded scales (like the 1–5 MOS), they tend to avoid extreme options — hesitating to label anything “perfect” or “terrible.” The result is a clustering of ratings toward the middle and an empirical ceiling below 5.0. Consequently, MOS values around 4.5 already correspond to perfectly natural human speech in subjective perception.

Summary

The MOS-Naturalness scale in NISQA-TTS quantifies how natural or human-like speech sounds on a 1–5 scale aligned with ITU-T P.85. In practice, due to listener behavior and dataset calibration, the usable range is approximately 1.0 – 4.7, with 4.5 – 4.8 representing human recordings and around 4.0 indicating very natural synthetic speech. Scores should therefore be interpreted relatively—not as absolute percentages of perfection, but as perceptually grounded indicators of naturalness.

All the AI features of Altered RealTime Pro

Experience
RealTime Pro

Transform your

video calls
voice chats
voice calls
video calls

with Altered Real-Time Pro

Download for Windows

Subscribe to our newsletter

Keep updated with the latest news

Company

Terms & Conditions Privacy Policy Ethics Careers Contact Us