VocEmb4SVS: Improving Singing Voice Separation with Vocal Embeddings
Chenyi Li, Yi Li, Xuhao Du, Yaolong Ju, Shichao Hu, Zhiyong Wu
Abstract
Deep learning-based methods have shown promising performance on singing voice separation (SVS). Recently, embeddings related to lyrics and voice activities have been proven effective to improve the performance of SVS tasks. However, embeddings related to singers have never been studied before. In this paper, we propose VocEmb4SVS, an SVS framework to utilize vocal embeddings of the singer as auxiliary knowledge for SVS conditioning. First, a pre-trained separation network is employed to obtain pre-separated vocals from the mixed music signals. Second, a vocal encoder is trained to extract vocal embeddings from the pre-separated vocals. Finally, the vocal embeddings are integrated into the separation network to improve SVS performance. Experimental results show that our proposed method achieves state-of-the-art results on the MUSDB18 dataset with an SDR of 9.56 dB on vocals.
Since we aim at reducing the distortion of vocals and interference of accompaniments, our demos demonstrate the improvement of separator P applying VocEmb4SVS compared with the original separator P. In addition, we display the vocal extraction effects when there are sound effects in the music. The demos of vocals separated by different methods classified according to their major effectiveness are listed below. These are some binaural music clips cut from MUSDB18 dataset. VocEmb4SVS (Separator P) means applying our proposed method VocEmb4SVS on Separator P and Reference means the ground-truth vocals in the mixed music from the open datasets MUSDB18 dataset.
For brevity, we only show the spectrograms and waveforms of the left channel. To see the complete binaural spectrograms and waveforms, please click on Expand binaural images.
Reducing the distortion of separated vocals
Case 1
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Expand binaural images
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
Reference
HDemucs
VocEmb4SVS (HDemucs)
Case 2
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Expand binaural images
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
Reference
HDemucs
VocEmb4SVS (HDemucs)
Reducing the interference of accompaniments
Case 3
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Expand binaural images
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
Reference
HDemucs
VocEmb4SVS (HDemucs)
Case 4
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Expand binaural images
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
Reference
HDemucs
VocEmb4SVS (HDemucs)
Case 5
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Expand binaural images
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
Reference
HDemucs
VocEmb4SVS (HDemucs)
Case 6
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Expand binaural images
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
HDemucs
VocEmb4SVS (HDemucs)
Reference
Mixture
ResUNetDecouple+
VocEmb4SVS (ResUNetDecouple+)
Reference
HDemucs
VocEmb4SVS (HDemucs)
Extract clean vocals from scenes with sound effects
In this part, we list music clips with sound effects. The results show that the separation systems can separate clean vocals with the sound effects reduced.