Gogo

Group-wise Granularity-ordered Codec for Stable and Efficient Speech Generation

Abstract. Current speech language models require their core component, the speech codec, to discretize continuous speech signals into tokens that not only capture high-level cues for autoregressive modeling but also preserve sufficient acoustic details for perceptual quality. To address this need, we propose Gogo, a group-wise granularity-ordered codec that quantizes each group of frames into tokens arranged from coarse to fine, where coarse tokens encode high-level abstractions and fine tokens progressively recover low-level details. Building on the granularity-ordering property of Gogo, we introduce GogoSpeech, a two-stage speech language model that performs speech generation by first constructing a coarse speech backbone at an extremely low token rate and then enriching the backbone with fine-grained acoustic details. Considering the inherently non-uniform information distribution in speech signals, we further design a Group Relative Policy Optimization (GRPO)-trained token allocator that adaptively allocates token budgets to groups based on group-wise complexity. Experimental results demonstrate that Gogo delivers state-of-the-art reconstruction performance across most metrics at a token rate of 47. Moreover, evaluations on zero-shot text-to-speech tasks show that GogoSpeech enables efficient generation by adaptively reducing the average token rate, and attains state-of-the-art results in long-form speech generation.

Contents

Note. If you have trouble playing the audio, please try to [download] it and play it locally.

System Overview

Figure 1. System overview. Only one group is plotted for simplicity. The shading of a token reflects the granularity of the information it encodes.

Codec Comparison

GT Gogo (47tps) SpeechTokenizer (50tps) MagiCodec (50tps) X-codec2 (50tps) TAAE (50tps) DualCodec (50tps) Mimi (50tps) Mimi (75tps) WavTokenizer (75tps) EnCodec (75tps) DAC (75tps) SNAC (82tps) Mimi (150tps) SpeechTokenizer (150tps) EnCodec (150tps) DAC (150tps) EnCodec (600tps) DAC (600tps)

Zero-Shot TTS Comparison

Target Text Prompt Ground Truth GogoSpeech CosyVoice 2 XTTS-v2 Llasa-8B FireRedTTS-1S F5-TTS VoiceCraft
I like the avenger of the new episode, he really is a badass.
A rich farm is rare in this sandy waste.
It is also used as an initial ingredient in homeopathic remedies.
Get the trust fund to the bank early.
NASA plans to launch the rocket tomorrow.


Long-form Generation

Target Text Prompt Ground Truth GogoSpeech CosyVoice 2 XTTS-v2 Llasa-8B FireRedTTS-1S F5-TTS VoiceCraft
Before guns were invented, armies had to throw bullets at each other and if a bullet touched you, you had to sit out until the next war. And the girl pointed to the south, indicating that it was there the strange man lived.
Kanwal is said to mean "snakes indeed" in a local Aboriginal language. In some species, females are also capable of stridulation.
Later, he expelled the Jews of Strassburg after a community debate. His sentence was suspended when he joined the commandos.
Before it was stopped, Vancouver's Hogan's Alley neighbourhood was largely demolished. He relocated to Bentonville, Arkansas, where he worked for Wal-Mart as a fitness trainer.
Another option is to get a professional certification from a national association. Today, the old observatory is no longer used for research.

Effectiveness of Token Allocator

GogoSpeech Sample 1 (47tps -> 41tps) Sample 2 (47tps -> 35tps) Sample 3 (47tps -> 33tps) Sample 4 (47tps -> 42tps) Sample 5 (47tps -> 34tps)
w/o Token Allocator
w/ Token Allocator


Long-form Generation

GogoSpeech Sample 1 (47tps -> 30tps) Sample 2 (47tps -> 35tps) Sample 3 (47tps -> 33tps) Sample 4 (47tps -> 38tps) Sample 5 (47tps -> 40tps)
w/o Token Allocator
w/ Token Allocator