VocEmb4SVS: Improving Singing Voice Separation with Vocal Embeddings

The demo page of VocEmb4SVS displayed by Chenyi Li.

Download this project as a .zip file Download this project as a tar.gz file

VocEmb4SVS: Improving Singing Voice Separation with Vocal Embeddings

Chenyi Li, Yi Li, Xuhao Du, Yaolong Ju, Shichao Hu, Zhiyong Wu

Abstract

Deep learning-based methods have shown promising performance on singing voice separation (SVS). Recently, embeddings related to lyrics and voice activities have been proven effective to improve the performance of SVS tasks. However, embeddings related to singers have never been studied before. In this paper, we propose VocEmb4SVS, an SVS framework to utilize vocal embeddings of the singer as auxiliary knowledge for SVS conditioning. First, a pre-trained separation network is employed to obtain pre-separated vocals from the mixed music signals. Second, a vocal encoder is trained to extract vocal embeddings from the pre-separated vocals. Finally, the vocal embeddings are integrated into the separation network to improve SVS performance. Experimental results show that our proposed method achieves state-of-the-art results on the MUSDB18 dataset with an SDR of 9.56 dB on vocals.

The model is described in our APSIPA 2022 paper.

Introduction of the demo

Since we aim at reducing the distortion of vocals and interference of accompaniments, our demos demonstrate the improvement of separator P applying VocEmb4SVS compared with the original separator P. In addition, we display the vocal extraction effects when there are sound effects in the music. The demos of vocals separated by different methods classified according to their major effectiveness are listed below. These are some binaural music clips cut from MUSDB18 dataset. VocEmb4SVS (Separator P) means applying our proposed method VocEmb4SVS on Separator P and Reference means the ground-truth vocals in the mixed music from the open datasets MUSDB18 dataset.

For brevity, we only show the spectrograms and waveforms of the left channel. To see the complete binaural spectrograms and waveforms, please click on Expand binaural images.

Reducing the distortion of separated vocals

Case 1

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Expand binaural images

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Mixture	ResUNetDecouple+	VocEmb4SVS (ResUNetDecouple+)

Reference	HDemucs	VocEmb4SVS (HDemucs)

Case 2

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Expand binaural images

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Mixture	ResUNetDecouple+	VocEmb4SVS (ResUNetDecouple+)

Reference	HDemucs	VocEmb4SVS (HDemucs)

Reducing the interference of accompaniments

Case 3

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Expand binaural images

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Mixture	ResUNetDecouple+	VocEmb4SVS (ResUNetDecouple+)

Reference	HDemucs	VocEmb4SVS (HDemucs)

Case 4

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Expand binaural images

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Mixture	ResUNetDecouple+	VocEmb4SVS (ResUNetDecouple+)

Reference	HDemucs	VocEmb4SVS (HDemucs)

Case 5

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Expand binaural images

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Mixture	ResUNetDecouple+	VocEmb4SVS (ResUNetDecouple+)

Reference	HDemucs	VocEmb4SVS (HDemucs)

Case 6

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Expand binaural images

Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference

Mixture	ResUNetDecouple+	VocEmb4SVS (ResUNetDecouple+)

Reference	HDemucs	VocEmb4SVS (HDemucs)

Extract clean vocals from scenes with sound effects

In this part, we list music clips with sound effects. The results show that the separation systems can separate clean vocals with the sound effects reduced.