VocEmb4SVS: Improving Singing Voice Separation with Vocal Embeddings

The demo page of VocEmb4SVS displayed by Chenyi Li.

Download this project as a .zip file Download this project as a tar.gz file

VocEmb4SVS: Improving Singing Voice Separation with Vocal Embeddings

Chenyi Li, Yi Li, Xuhao Du, Yaolong Ju, Shichao Hu, Zhiyong Wu

Abstract

Deep learning-based methods have shown promising performance on singing voice separation (SVS). Recently, embeddings related to lyrics and voice activities have been proven effective to improve the performance of SVS tasks. However, embeddings related to singers have never been studied before. In this paper, we propose VocEmb4SVS, an SVS framework to utilize vocal embeddings of the singer as auxiliary knowledge for SVS conditioning. First, a pre-trained separation network is employed to obtain pre-separated vocals from the mixed music signals. Second, a vocal encoder is trained to extract vocal embeddings from the pre-separated vocals. Finally, the vocal embeddings are integrated into the separation network to improve SVS performance. Experimental results show that our proposed method achieves state-of-the-art results on the MUSDB18 dataset with an SDR of 9.56 dB on vocals.

The model is described in our APSIPA 2022 paper.


Introduction of the demo

Since we aim at reducing the distortion of vocals and interference of accompaniments, our demos demonstrate the improvement of separator P applying VocEmb4SVS compared with the original separator P. In addition, we display the vocal extraction effects when there are sound effects in the music. The demos of vocals separated by different methods classified according to their major effectiveness are listed below. These are some binaural music clips cut from MUSDB18 dataset. VocEmb4SVS (Separator P) means applying our proposed method VocEmb4SVS on Separator P and Reference means the ground-truth vocals in the mixed music from the open datasets MUSDB18 dataset.

For brevity, we only show the spectrograms and waveforms of the left channel. To see the complete binaural spectrograms and waveforms, please click on Expand binaural images.

Reducing the distortion of separated vocals

Case 1

Mixture mix08s1 mix08wv1
ResUNetDecouple+ res08s1 res08wv1
VocEmb4SVS (ResUNetDecouple+) pre08s1 pre08wv1
HDemucs hdemucs08s1 hdemucs08wv1
VocEmb4SVS (HDemucs) hu08s1 hu08wv1
Reference clean08s1 clean08wv1
Expand binaural images
Mixture mix08s2 mix08wv2
ResUNetDecouple+ res08s2 res08wv2
VocEmb4SVS (ResUNetDecouple+) pre08s2 pre08wv2
HDemucs hdemucs08s2 hdemucs08wv2
VocEmb4SVS (HDemucs) hu08s2 hu08wv2
Reference clean08s2 clean08wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)

Case 2

Mixture mix03s1 mix03wv1
ResUNetDecouple+ res03s1 res03wv1
VocEmb4SVS (ResUNetDecouple+) pre03s1 pre03wv1
HDemucs hdemucs03s1 hdemucs03wv1
VocEmb4SVS (HDemucs) hu03s1 hu03wv1
Reference clean03s1 clean03wv1
Expand binaural images
Mixture mix03s2 mix03wv2
ResUNetDecouple+ res03s2 res03wv2
VocEmb4SVS (ResUNetDecouple+) pre03s2 pre03wv2
HDemucs hdemucs03s2 hdemucs03wv2
VocEmb4SVS (HDemucs) hu03s2 hu03wv2
Reference clean03s2 clean03wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)

Reducing the interference of accompaniments

Case 3

Mixture mix02s1 mix02wv1
ResUNetDecouple+ res02s1 res02wv1
VocEmb4SVS (ResUNetDecouple+) pre02s1 pre02wv1
HDemucs hdemucs02s1 hdemucs02wv1
VocEmb4SVS (HDemucs) hu02s1 hu02wv1
Reference clean02s1 clean02wv1
Expand binaural images
Mixture mix02s2 mix02wv2
ResUNetDecouple+ res02s2 res02wv2
VocEmb4SVS (ResUNetDecouple+) pre02s2 pre02wv2
HDemucs hdemucs02s2 hdemucs02wv2
VocEmb4SVS (HDemucs) hu02s2 hu02wv2
Reference clean02s2 clean02wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)

Case 4

Mixture mix06s1 mix06wv1
ResUNetDecouple+ res06s1 res06wv1
VocEmb4SVS (ResUNetDecouple+) pre06s1 pre06wv1
HDemucs hdemucs06s1 hdemucs06wv1
VocEmb4SVS (HDemucs) hu06s1 hu06wv1
Reference clean06s1 clean06wv1
Expand binaural images
Mixture mix06s2 mix06wv2
ResUNetDecouple+ res06s2 res06wv2
VocEmb4SVS (ResUNetDecouple+) pre06s2 pre06wv2
HDemucs hdemucs06s2 hdemucs06wv2
VocEmb4SVS (HDemucs) hu06s2 hu06wv2
Reference clean06s2 clean06wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)

Case 5

Mixture mix07s1 mix07wv1
ResUNetDecouple+ res07s1 res07wv1
VocEmb4SVS (ResUNetDecouple+) pre07s1 pre07wv1
HDemucs hdemucs07s1 hdemucs07wv1
VocEmb4SVS (HDemucs) hu07s1 hu07wv1
Reference clean07s1 clean07wv1
Expand binaural images
Mixture mix07s2 mix07wv2
ResUNetDecouple+ res07s2 res07wv2
VocEmb4SVS (ResUNetDecouple+) pre07s2 pre07wv2
HDemucs hdemucs07s2 hdemucs07wv2
VocEmb4SVS (HDemucs) hu07s2 hu07wv2
Reference clean07s2 clean07wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)

Case 6

Mixture mix01s1 mix01wv1
ResUNetDecouple+ res01s1 res01wv1
VocEmb4SVS (ResUNetDecouple+) pre01s1 pre01wv1
HDemucs hdemucs01s1 hdemucs01wv1
VocEmb4SVS (HDemucs) hu01s1 hu01wv1
Reference clean01s1 clean01wv1
Expand binaural images
Mixture mix01s2 mix01wv2
ResUNetDecouple+ res01s2 res01wv2
VocEmb4SVS (ResUNetDecouple+) pre01s2 pre01wv2
HDemucs hdemucs01s2 hdemucs01wv2
VocEmb4SVS (HDemucs) hu01s2 hu01wv2
Reference clean01s2 clean01wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)

Extract clean vocals from scenes with sound effects

In this part, we list music clips with sound effects. The results show that the separation systems can separate clean vocals with the sound effects reduced.

Case 7

Mixture mix11s1 mix11wv1
ResUNetDecouple+ res11s1 res11wv1
VocEmb4SVS (ResUNetDecouple+) pre11s1 pre11wv1
HDemucs hdemucs11s1 hdemucs11wv1
VocEmb4SVS (HDemucs) hu11s1 hu11wv1
Reference clean11s1 clean11wv1
Expand binaural images
Mixture mix11s2 mix11wv2
ResUNetDecouple+ res11s2 res11wv2
VocEmb4SVS (ResUNetDecouple+) pre11s2 pre11wv2
HDemucs hdemucs11s2 hdemucs11wv2
VocEmb4SVS (HDemucs) hu11s2 hu11wv2
Reference clean11s2 clean11wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)

Case 8

Mixture mix12s1 mix12wv1
ResUNetDecouple+ res12s1 res12wv1
VocEmb4SVS (ResUNetDecouple+) pre12s1 pre12wv1
HDemucs hdemucs12s1 hdemucs12wv1
VocEmb4SVS (HDemucs) hu12s1 hu12wv1
Reference clean12s1 clean12wv1
Expand binaural images
Mixture mix12s2 mix12wv2
ResUNetDecouple+ res12s2 res12wv2
VocEmb4SVS (ResUNetDecouple+) pre12s2 pre12wv2
HDemucs hdemucs12s2 hdemucs12wv2
VocEmb4SVS (HDemucs) hu12s2 hu12wv2
Reference clean12s2 clean12wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)

Case 9

Mixture mix05s1 mix05wv1
ResUNetDecouple+ res05s1 res05wv1
VocEmb4SVS (ResUNetDecouple+) pre05s1 pre05wv1
HDemucs hdemucs05s1 hdemucs05wv1
VocEmb4SVS (HDemucs) hu05s1 hu05wv1
Reference clean05s1 clean05wv1
Expand binaural images
Mixture mix05s2 mix05wv2
ResUNetDecouple+ res05s2 res05wv2
VocEmb4SVS (ResUNetDecouple+) pre05s2 pre05wv2
HDemucs hdemucs05s2 hdemucs05wv2
VocEmb4SVS (HDemucs) hu05s2 hu05wv2
Reference clean05s2 clean05wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)

Case 10

Mixture mix04s1 mix04wv1
ResUNetDecouple+ res04s1 res04wv1
VocEmb4SVS (ResUNetDecouple+) pre04s1 pre04wv1
HDemucs hdemucs04s1 hdemucs04wv1
VocEmb4SVS (HDemucs) hu04s1 hu04wv1
Reference clean04s1 clean04wv1
Expand binaural images
Mixture mix04s2 mix04wv2
ResUNetDecouple+ res04s2 res04wv2
VocEmb4SVS (ResUNetDecouple+) pre04s2 pre04wv2
HDemucs hdemucs04s2 hdemucs04wv2
VocEmb4SVS (HDemucs) hu04s2 hu04wv2
Reference clean04s2 clean04wv2

Mixture ResUNetDecouple+ VocEmb4SVS (ResUNetDecouple+)
Reference HDemucs VocEmb4SVS (HDemucs)