STTS

Scaling Compute-Optimal Text-to-Speech Models


Anonymous Authors

Abstract. We introduce STTS, a text-to-speech (TTS) framework that unifies high-fidelity neural audio compression with in-context language modeling at 44.1 kHz. Building on a single-codebook BigCodec modified with Finite Scalar Quantization (FSQ) and multi-scale adversarial losses, our system compresses speech to 0.8--1.2 kbps while preserving fine prosody and timbre. A decoder-only Transformer then generates codec tokens from text and a short reference utterance, enabling real-time speaker adaptation with no extra fine-tuning. We train models from 40M to 70B parameters on datasets of 1K--100K hours and systematically study how model size, data scale, and bitrate affect intelligibility and speaker similarity. Experiments reveal a 1 kbps “sweet spot” that balances fidelity with manageable sequence lengths and identify predictable scaling laws analogous to text-only LLMs, showing how performance improves under increasing compute. STTS thus provides a compute-optimal strategy for large-scale, high-quality TTS at commercial sampling rates.

Overview

We compress raw 44.1 kHz audio using a single-codebook BigCodec with finite scalar quantization, trained under multi-scale adversarial objectives for low-bit-rate speech reconstruction. The compressed tokens serve as the “audio vocabulary” alongside text tokens for a Llama 3 decoder, which generates output speech tokens autoregressively. This unified model allows in-context speaker adaptation (via reference audio tokens) and text-to-speech generation.

Zero-Shot TTS Samples

Text Prompt STTS
two thousand two hundred twenty two happily happy two hundred and twenty-two
As an aficionado of Scandinavian design, Ole Gunnarsson appreciated the principle of "hygge," evident in his Danish home
Throughout her distinguished diplomatic career spanning five decades and numerous international crises, Ambassador Chen had developed a reputation for finding common ground between opposing factions through careful listening, cultural sensitivity, and an unwavering commitment to humanitarian principle
As the sun dipped below the horizon, casting a golden glow over the ocean, Emily, who had spent her life dreaming of distant shores, stood on the deck of the ship, feeling a mixture of anticipation and nostalgia as her adventure began.
With an ample supply of joie de vivre, Mary danced through the streets of Nice, stopping only to enjoy a nice cafe with a warm croissant.
Can you believe it's been twenty years since we graduated? Sometimes it feels like yesterday we were cramming for finals and planning post-graduation road trips, and other times it seems like several lifetimes ago. I wonder how many of our classmates actually ended up pursuing the careers they thought they would. I certainly never imagined I'd be teaching environmental science in a rural community college, but honestly, I wouldn't change a thing.

Comparitive Analysis

- Speaker Sim↑ WER↓
STTS 0.408 0.055
OpenVoice 0.259 0.003
CosyVoice2 0.514 0.011
VoiceCraft 0.451 0.012
CosyVoice 0.464 0.011


Text Prompt Ground Truth STTS OpenVoice CosyVoice2
The difference in the rainbow depends considerably upon the size of the drops, and the width of the colored band increases as the size of the drops increases.
Promises were not kept.
It is the same in Sweden.
When a man looks for something beyond his reach, his friends say he's looking for the pot of gold at the end of the rainbow.
If the red of the second bow falls upon the green of the first, the result is to give a bow with an abnormally wide yellow band, since red and green light when mixed form yellow.