Voice Conversion Sound Samples
VTLN-Based Voice Conversion
Vocal tract length normalization (VTLN) is a speaker normalization technique well-studied in speech recognition. It aims at compensating for
speaker-dependent vocal tract lengths by warping the frequency axis of the magnitude spectrum. Recently, I applied this technique to
voice conversion where it serves for generating several voices from one source voice by changing warping factors and prosody-related parameters.
In a publication at the Interspeech'05, I investigated the application of
VTLN-based voice conversion to a speech synthesizer that is to generate
a couple of well-distinguishable voices (as required in computer games, radio plays, and the like). We observe that voice characteristics
as gender or age can be manipulated using this technique.
In the following example, a female source voice is changed by setting the warping factor a of a piece-wise
linear warping function and the fundamental frequency ratio r. Here, the source voice is that with the default values
a=1.0 and r=1.00.
Text-Independent Cross-Language Voice Conversion
So far, most of the voice conversion training algorithms require parallel training data of the source and target speaker, i. e., both
speakers' utterances are based on the same text (text-dependent training). Often, this requirement is inconvenient or unrealizable. For
instance, when source and target speaker use different languages (cross-language voice conversion), parallel training data can only be
produced if at least one of the speakers speaks both, source and target language.
In a paper, which I published at ICASSP'06, I presented a mapping algorithm based on unit selection (a technique well-known from
speech synthesis) that uses non-parallel training data and can be applied to cross-language voice conversion.
In the following example, we have an English female and a male source voice (f1 and m1) which are converted to
sound like a given Spanish female or a male target voice, respectively (f2 and m2) using the above described technique.
Middle High German Voice-Converted
In the framework of a project on German medieval culture,
the multimedial group at RWTH Aachen is building a feature, where visitors
speak an arbitrary utterance in their mothertongue into a microphone, and after a few seconds, they hear their own voice speaking a Middle
High German poem. The technology is based on a combination of the aforementioned approaches. An example is given here:
(c) 2006-10-24 by david@suendermann.com