DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System

Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Chunlei Zhang, Yusong Wu, Xiang Xie, Zijin Li, Dong Yu
Tencent AI Lab

Abstract

Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data. In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data. First, we manage to integrate the training and conversion process of speech and singing into one framework by unifying the features used in standard speech synthesis system and singing synthesis system. In this way, normal speech data can also contribute to singing voice conversion training, making the singing voice conversion system more robust especially when the singing database is small. Moreover, in order to achieve one-shot singing voice conversion, a speaker embedding module is developed using both speech and singing data, which provides target speaker identify information during conversion. Experiments indicate proposed sing conversion system can convert source singing to target speaker's high-quality singing with only 20 seconds of target speaker's enrollment speech data.


arch


Look Up Table(LUT) based speaker embedding & D-vector based speaker embedding


* Note: All samples are in Mandrin Chinese.
* There are 6 in-set singers—3 male singers and 3 female singers, shown here.
* The “Reference Voice” is shown here for timbre similarity test, which is the target singer’s singing.


Reference Voice LUT Sample D-vector Sample
Female Singer1
Female Singer2
Female Singer3
Male Singer1
Male Singer2
Male Singer3


Out-of-set test of D-vector based speaker embedding


* Note: All samples are in Mandrin Chinese.
* There are 4 out-of-set speakers—2 male speakers and 2 female speakers, shown here.
* The “Register Voice” is shown here for similarity test, which is the target speaker’s speech.


Register Voice D-vector Sample
Female Speaker1
Female Speaker2
Male Speaker1
Male Speaker2


Training with speech corpus


* Note: All samples are in Mandrin Chinese.
* There are 6 in-set speakers—3 male speakers and 3 female speakers, shown here.
* The “Reference Voice” is shown here for timbre similarity test, which is the target speaker’s speech.
* “Speech only” means training only with speech data, while “Speech & Singing” means training with speech data and other singers’ singing data.


Reference Voice Speech Only Speechh & Singing
Male Speaker1
Male Speaker2
Male Speaker3
Female Speaker1
Female Speaker2
Female Speaker3