SLM-SS

SLM-SS:

Speech Language Model for Generative Speech Separation

Tianhua Li¹, Chenda Li^1,2,*, Wei Wang¹, Xin Zhou¹, Xihui Chen¹, Jianqing Gao³, Yanmin Qian^1,2,*

¹Auditory Cognition and Computational Acoustics Lab
MoE Key Lab of Artificial Intelligence, AI Institute
School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
²VUI Labs
³AI research Institute, iFLYTEK Company Limited, Hefei, Anhui, China
^* represents the corresponding author

Abstract.Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.

Demo audios from different models

Groundtruth	Encodec-32	Encodec-8	BSRNN	Sepformer	SLM-SS(Ours)