Group-wise Granularity-ordered Codec for Stable and Efficient Speech Generation
Abstract. Current speech language models require their core component, the speech codec, to discretize continuous speech signals into tokens that not only capture high-level cues for autoregressive modeling but also preserve sufficient acoustic details for perceptual quality. To address this need, we propose Gogo, a group-wise granularity-ordered codec that quantizes each group of frames into tokens arranged from coarse to fine, where coarse tokens encode high-level abstractions and fine tokens progressively recover low-level details. Building on the granularity-ordering property of Gogo, we introduce GogoSpeech, a two-stage speech language model that performs speech generation by first constructing a coarse speech backbone at an extremely low token rate and then enriching the backbone with fine-grained acoustic details. Considering the inherently non-uniform information distribution in speech signals, we further design a Group Relative Policy Optimization (GRPO)-trained token allocator that adaptively allocates token budgets to groups based on group-wise complexity. Experimental results demonstrate that Gogo delivers state-of-the-art reconstruction performance across most metrics at a token rate of 47. Moreover, evaluations on zero-shot text-to-speech tasks show that GogoSpeech enables efficient generation by adaptively reducing the average token rate, and attains state-of-the-art results in long-form speech generation.
Note. If you have trouble playing the audio, please try to [download] it and play it locally.
System Overview
Figure 1. System overview. Only one group is plotted for simplicity. The shading of a token reflects the granularity of the information it encodes.
Codec Comparison
GT
Gogo (47tps)
SpeechTokenizer (50tps)
MagiCodec (50tps)
X-codec2 (50tps)
TAAE (50tps)
DualCodec (50tps)
Mimi (50tps)
Mimi (75tps)
WavTokenizer (75tps)
EnCodec (75tps)
DAC (75tps)
SNAC (82tps)
Mimi (150tps)
SpeechTokenizer (150tps)
EnCodec (150tps)
DAC (150tps)
EnCodec (600tps)
DAC (600tps)
Zero-Shot TTS Comparison
Target Text
Prompt
Ground Truth
GogoSpeech
CosyVoice 2
XTTS-v2
Llasa-8B
FireRedTTS-1S
F5-TTS
VoiceCraft
I like the avenger of the new episode, he really is a badass.
A rich farm is rare in this sandy waste.
It is also used as an initial ingredient in homeopathic remedies.
Get the trust fund to the bank early.
NASA plans to launch the rocket tomorrow.
Long-form Generation
Target Text
Prompt
Ground Truth
GogoSpeech
CosyVoice 2
XTTS-v2
Llasa-8B
FireRedTTS-1S
F5-TTS
VoiceCraft
Before guns were invented, armies had to throw bullets at each other and if a bullet touched you, you had to sit out until the next war. And the girl pointed to the south, indicating that it was there the strange man lived.
Kanwal is said to mean "snakes indeed" in a local Aboriginal language. In some species, females are also capable of stridulation.
Later, he expelled the Jews of Strassburg after a community debate. His sentence was suspended when he joined the commandos.
Before it was stopped, Vancouver's Hogan's Alley neighbourhood was largely demolished. He relocated to Bentonville, Arkansas, where he worked for Wal-Mart as a fitness trainer.
Another option is to get a professional certification from a national association. Today, the old observatory is no longer used for research.