SLM-SS:

Speech Language Model for Generative Speech Separation


Tianhua Li1, Chenda Li1,2,*, Wei Wang1, Xin Zhou1, Xihui Chen1, Jianqing Gao3, Yanmin Qian1,2,*

1Auditory Cognition and Computational Acoustics Lab
MoE Key Lab of Artificial Intelligence, AI Institute
School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
2VUI Labs
3AI research Institute, iFLYTEK Company Limited, Hefei, Anhui, China
* represents the corresponding author

Abstract.Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.

Demo audios from different models

Groundtruth Encodec-32 Encodec-8 BSRNN Sepformer SLM-SS(Ours)