Voice Conversion Sound Samples

VTLN-Based Voice Conversion

Vocal tract length normalization (VTLN) is a speaker normalization technique well-studied in speech recognition. It aims at compensating for speaker-dependent vocal tract lengths by warping the frequency axis of the magnitude spectrum. Recently, I applied this technique to voice conversion where it serves for generating several voices from one source voice by changing warping factors and prosody-related parameters. In a publication at the Interspeech'05, I investigated the application of VTLN-based voice conversion to a speech synthesizer that is to generate a couple of well-distinguishable voices (as required in computer games, radio plays, and the like). We observe that voice characteristics as gender or age can be manipulated using this technique.
In the following example, a female source voice is changed by setting the warping factor a of a piece-wise linear warping function and the fundamental frequency ratio r. Here, the source voice is that with the default values a=1.0 and r=1.00.

r
           0.50  0.63  0.79  1.00  1.26  1.59  2.00 
0.7 x x x x x x x
0.8 x x x x x x x
0.9 x x x x x x x
a   1.0 x x x x x x x
1.1 x x x x x x x
1.2 x x x x x x x
1.3 x x x x x x x



Text-Independent Cross-Language Voice Conversion

So far, most of the voice conversion training algorithms require parallel training data of the source and target speaker, i. e., both speakers' utterances are based on the same text (text-dependent training). Often, this requirement is inconvenient or unrealizable. For instance, when source and target speaker use different languages (cross-language voice conversion), parallel training data can only be produced if at least one of the speakers speaks both, source and target language.
In a paper, which I published at ICASSP'06, I presented a mapping algorithm based on unit selection (a technique well-known from speech synthesis) that uses non-parallel training data and can be applied to cross-language voice conversion.
In the following example, we have an English female and a male source voice (f1 and m1) which are converted to sound like a given Spanish female or a male target voice, respectively (f2 and m2) using the above described technique.

   f2 m2
   f2a m2a
   f2b m2b
   f2c m2c
   f1a    f1a-f2 f1a-m2
f1    f1b    f1b-f2 f1b-m2
   f1c    f1c-f2 f1c-m2
 
   m1a    m1a-f2 m1a-m2
m1     m1b       m1b-f2      m1b-m2 
   m1c    m1c-f2 m1c-m2



Middle High German Voice-Converted

In the framework of a project on German medieval culture, the multimedial group at RWTH Aachen is building a feature, where visitors speak an arbitrary utterance in their mothertongue into a microphone, and after a few seconds, they hear their own voice speaking a Middle High German poem. The technology is based on a combination of the aforementioned approaches. An example is given here:

   m2f       f2m   
target       f m
source    m f
s2t    m2f f2m


(c) 2006-10-24 by david@suendermann.com