Abstract

Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9\% on ViSQOL and 76.3\% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling.

Keywords: Speech Compression, Neural Speech Codec, Rate--Distortion Optimization, Entropy-Constrained Coding, Entropy Model

Overview

Teaser overview of the benchmark and proposed entropy-constrained neural speech codec.
Teaser overview of the benchmark and proposed entropy-constrained neural speech codec for content-aware variable-rate speech compression.

Conventional speech codecs rely on hand-designed transform, quantization, and entropy-coding tools, while recent neural codecs learn nonlinear speech representations from data. ECC bridges these two paradigms by integrating learned entropy modeling into neural transform coding, so that the latent representation is optimized for both reconstruction quality and statistical compressibility.

Figures

Taxonomy of recent learning-based speech compression methods.
Taxonomy of recent learning-based speech compression methods, organized by input and output domain, encoder-decoder backbone, quantization and entropy coding, and training objectives.

We organize recent learning-based speech codecs along four axes: input/output domain, encoder-decoder backbone, quantization and entropy coding, and training objectives. This taxonomy highlights how different codec designs represent speech signals, model temporal dependencies, construct discrete symbols, and handle coding costs.

Overview of the proposed entropy-constrained codec for speech compression.
Overview of the proposed entropy-constrained codec (ECC). The transform operates in the STFT domain and the entropy model combines hyperprior features, channel-wise context modeling, and latent residual prediction.

ECC jointly learns the analysis-synthesis transform, the latent probability model, and the reconstruction objective. The waveform is first converted into an STFT representation, then encoded into scalar latents whose probabilities are estimated from hyperprior features and decoded context for rate estimation during training and arithmetic coding during inference.

Channel-wise entropy model.
Channel-wise entropy model with hyperprior side information, sequential latent-slice context prediction, and LRP-based decoded latent refinement.

The channel-wise entropy model decodes hyper-latent side information into mean and scale features, then processes latent slices sequentially. For each slice, Gaussian parameters are predicted from the hyperprior features and previously decoded slices, while latent residual prediction refines the decoded slice before synthesis and subsequent context prediction.

Rate-distortion performance on LibriTTS.
Rate-distortion performance on LibriTTS (test-all).

On LibriTTS test-all, ECC consistently occupies a favorable low-bitrate region across perceptual quality, intelligibility, recognition, and speaker-preservation metrics. This demonstrates the effectiveness of entropy-constrained scalar-latent coding on the in-domain evaluation set.

Rate-distortion performance on VCTK.
Rate-distortion performance on VCTK.

On VCTK, ECC preserves a similar low-bitrate advantage under speaker and recording-condition shifts, indicating that the learned representation and entropy model generalize beyond the LibriTTS training corpus.

MUSHRA subjective test results in the low-bitrate regime.
MUSHRA subjective test results in the low-bitrate regime.

The MUSHRA listening test further confirms the low-bitrate perceptual quality of ECC. ECC obtains high subjective scores on LibriTTS test-clean, LibriTTS test-other, and VCTK, remaining close to the strongest competing neural codec while operating at a lower bitrate.

Audio Samples

Comparison of proposed method (ECC) with other methods across four datasets.

LibriTTS (Clean)

LibriTTS (Other)

VCTK

AIShell-3