kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization
Introduction
Let's start with a real use case. Say we want to convert the following Spanish bass to the following Spanish soprano. They have high phoneme overlaps due to being parts of a choral ensemble, creating a nice testing environment for zero-shot SVC.
src
ref
Here are the conversion results. None of our baselines and ablations have been trained on Spanish audio.
knn-svc w/o {AS, OPT}
NeuCoSVC
knn-svc w/o {OPT}
knn-svc
knn-svc w/o {AS, OPT}: Sounds much duller than others, along with constant ringing.
NeuCoSVC: Sometimes with glitches (e.g. 00:07). Also, notice that the soprano sounds thinner, likely influenced by its Mandarin corpus (OpenSinger) training.
knn-svc vs knn-svc w/o {OPT}: Notice that trembling (e.g. around 00:12, 00:26, 00:33, 00:51) significantly decreases. This is the main symptom the smoothness optimization intends to treat.
Ablation and Model Comparisons
In this section, we provide samples for the ablation and model comparisons section (Table 1). "ls" refers to the LibriSpeech dataset. "os" refers to the OpenSinger dataset.
src
ref (not the entire pool)
knn-vc
knn-svc w/o {AS, OPT}
NeuCoSVC
knn-svc w/o {OPT}
knn-svc
ls → ls
--
ls → ls
--
ls → ls
--
os → os
--
os → os
--
os → nus48e
--
os → nus48e
--
Duration Study
In this section, we provide examples for the duration study section (not in the paper due to length limit). The durations indicate those of the references.
src
ref (not the entire pool)
5s
10s
30s
60s
90s
full(~600s)
NeuCoSVC os → os
knn-svc os → os
NeuCoSVC os → os
knn-svc os → os
NeuCoSVC os → nus
knn-svc os → nus
NeuCoSVC os → nus
knn-svc os → nus
Bonus
In this section, we play with knn-svc in various real use cases.
src (English Male Speaker)
ref (English Female Speaker)
knn-vc converted
knn-svc converted
src (French Male Singer)
ref (English Male Singer (with guitar))
knn-svc converted
src (Japanese Male Singer (with vocal extraction))