MOS-Naturalness Scale

The MOS-Naturalness score estimated via NISQA-TTS (Non-Intrusive Speech Quality Assessment - Text-To-Speech) represents the perceived naturalness and human-likeness of synthetic speech. It follows the ITU-T P.85 absolute category rating (ACR) protocol, where listeners judge samples on a 1–5 scale:

ScoreDescriptor
5 – Completely naturalIndistinguishable from real human speech; no audible synthetic artifacts.
4 – Mostly naturalMinor synthetic cues perceptible, but still convincingly human.
3 – Fairly naturalNoticeably synthetic yet pleasant and intelligible.
2 – UnnaturalClearly synthetic or robotic; distracting artifacts or poor prosody.
1 – Completely unnaturalStrongly artificial, distorted, or mechanical; unpleasant to hear.

Practical Range and Expected Scores

Although the theoretical scale runs from 1 to 5, in real listening tests the effective range is narrower:

  • Human reference recordings typically score ≈ 4.5 – 4.8, rarely a perfect 5.0.
  • Modern neural TTS systems (e.g., Tacotron 2, VITS, Vall-E) achieve ≈ 3.9 – 4.4.
  • Mid-quality synthetic voices cluster around 2.5 – 3.5.
  • Clearly robotic or degraded speech falls below 2.0.

NISQA-TTS reproduces this empirical distribution: even for flawless natural recordings, predicted values seldom exceed 4.6 – 4.7, while highly natural TTS outputs occupy the 3.5 – 4.3 range.

Why the Ceiling Is Lower Than 5.0

This compression of the scale is not a modeling flaw but a human-rating phenomenon known as central tendency bias. When people use bounded scales (like the 1–5 MOS), they tend to avoid extreme options — hesitating to label anything “perfect” or “terrible.” The result is a clustering of ratings toward the middle and an empirical ceiling below 5.0. Consequently, MOS values around 4.5 already correspond to perfectly natural human speech in subjective perception.

Summary

The MOS-Naturalness scale in NISQA-TTS quantifies how natural or human-like speech sounds on a 1–5 scale aligned with ITU-T P.85. In practice, due to listener behavior and dataset calibration, the usable range is approximately 1.0 – 4.7, with 4.5 – 4.8 representing human recordings and around 4.0 indicating very natural synthetic speech. Scores should therefore be interpreted relatively—not as absolute percentages of perfection, but as perceptually grounded indicators of naturalness.

All the AI features of Altered RealTime Pro

Experience
RealTime Pro

Transform your
  • video calls
  • voice chats
  • voice calls
  • video calls
with Altered Real-Time Pro
Download for Windows
Altered logo

Subscribe to our newsletter

Keep updated with the latest news
Copyright © 2022-2023 Altered. All rights reserved.