Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

Abstract

Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9\% on ViSQOL and 76.3\% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling.

Keywords: Speech Compression, Neural Speech Codec, Rate--Distortion Optimization, Entropy-Constrained Coding, Entropy Model

Overview

Figures

Taxonomy of recent learning-based speech compression methods. — Taxonomy of recent learning-based speech compression methods, organized by input and output domain, encoder-decoder backbone, quantization and entropy coding, and training objectives.

Overview of the proposed entropy-constrained codec for speech compression. — Overview of the proposed entropy-constrained codec (ECC). The transform operates in the STFT domain and the entropy model combines hyperprior features, channel-wise context modeling, and latent residual prediction.

Channel-wise entropy model. — Channel-wise entropy model with hyperprior side information, sequential latent-slice context prediction, and LRP-based decoded latent refinement.

Rate-distortion performance on LibriTTS. — Rate-distortion performance on LibriTTS (test-all).

MUSHRA subjective test results in the low-bitrate regime.

Audio Samples

Comparison of proposed method (ECC) with other methods across four datasets.

LibriTTS (Clean)

LibriTTS (Other)

VCTK

AIShell-3