kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization



Introduction

Let's start with a real use case. Say we want to convert the following Spanish bass to the following Spanish soprano. They have high phoneme overlaps due to being parts of a choral ensemble, creating a nice testing environment for zero-shot SVC.

src
ref

Here are the conversion results. None of our baselines and ablations have been trained on Spanish audio.

knn-svc w/o {AS, OPT}
NeuCoSVC
knn-svc w/o {OPT}
knn-svc


Ablation and Model Comparisons

In this section, we provide samples for the ablation and model comparisons section (Table 1). "ls" refers to the LibriSpeech dataset. "os" refers to the OpenSinger dataset.

srcref (not the entire pool)knn-vcknn-svc w/o {AS, OPT}NeuCoSVCknn-svc w/o {OPT}knn-svc
ls → ls--
ls → ls--
ls → ls--
os → os--
os → os--
os → nus48e--
os → nus48e--


Duration Study

In this section, we provide examples for the duration study section (not in the paper due to length limit). The durations indicate those of the references.

srcref (not the entire pool)5s10s30s60s90sfull(~600s)
NeuCoSVC os → os
knn-svc os → os
NeuCoSVC os → os
knn-svc os → os
NeuCoSVC os → nus
knn-svc os → nus
NeuCoSVC os → nus
knn-svc os → nus


Bonus

In this section, we play with knn-svc in various real use cases.

src (English Male Speaker)
ref (English Female Speaker)
knn-vc converted
knn-svc converted

src (French Male Singer)
ref (English Male Singer (with guitar))
knn-svc converted

src (Japanese Male Singer (with vocal extraction))
ref (Japanese Female Singer (with vocal extraction))
knn-svc converted