MOS-Naturalness Scale
The MOS-Naturalness score estimated via NISQA-TTS (Non-Intrusive Speech Quality Assessment - Text-To-Speech) represents the perceived naturalness and human-likeness of synthetic speech. It follows the ITU-T P.85 absolute category rating (ACR) protocol, where listeners judge samples on a 1–5 scale:
| Score | Descriptor |
|---|---|
| 5 – Completely natural | Indistinguishable from real human speech; no audible synthetic artifacts. |
| 4 – Mostly natural | Minor synthetic cues perceptible, but still convincingly human. |
| 3 – Fairly natural | Noticeably synthetic yet pleasant and intelligible. |
| 2 – Unnatural | Clearly synthetic or robotic; distracting artifacts or poor prosody. |
| 1 – Completely unnatural | Strongly artificial, distorted, or mechanical; unpleasant to hear. |
Practical Range and Expected Scores
Although the theoretical scale runs from 1 to 5, in real listening tests the effective range is narrower:
- Human reference recordings typically score ≈ 4.5 – 4.8, rarely a perfect 5.0.
- Modern neural TTS systems (e.g., Tacotron 2, VITS, Vall-E) achieve ≈ 3.9 – 4.4.
- Mid-quality synthetic voices cluster around 2.5 – 3.5.
- Clearly robotic or degraded speech falls below 2.0.
NISQA-TTS reproduces this empirical distribution: even for flawless natural recordings, predicted values seldom exceed 4.6 – 4.7, while highly natural TTS outputs occupy the 3.5 – 4.3 range.
Why the Ceiling Is Lower Than 5.0
This compression of the scale is not a modeling flaw but a human-rating phenomenon known as central tendency bias. When people use bounded scales (like the 1–5 MOS), they tend to avoid extreme options — hesitating to label anything “perfect” or “terrible.” The result is a clustering of ratings toward the middle and an empirical ceiling below 5.0. Consequently, MOS values around 4.5 already correspond to perfectly natural human speech in subjective perception.
Summary
The MOS-Naturalness scale in NISQA-TTS quantifies how natural or human-like speech sounds on a 1–5 scale aligned with ITU-T P.85. In practice, due to listener behavior and dataset calibration, the usable range is approximately 1.0 – 4.7, with 4.5 – 4.8 representing human recordings and around 4.0 indicating very natural synthetic speech. Scores should therefore be interpreted relatively—not as absolute percentages of perfection, but as perceptually grounded indicators of naturalness.

Experience
RealTime Pro
- video calls
- voice chats
- voice calls
- video calls