Voice Conversion Sound Samples

Voice Conversion Sound Samples

VTLN-Based Voice Conversion

Vocal tract length normalization (VTLN) is a speaker normalization technique well-studied in speech recognition. It aims at compensating for speaker-dependent vocal tract lengths by warping the frequency axis of the magnitude spectrum. Recently, I applied this technique to voice conversion where it serves for generating several voices from one source voice by changing warping factors and prosody-related parameters. In a publication at the Interspeech'05, I investigated the application of VTLN-based voice conversion to a speech synthesizer that is to generate a couple of well-distinguishable voices (as required in computer games, radio plays, and the like). We observe that voice characteristics as gender or age can be manipulated using this technique.
In the following example, a female source voice is changed by setting the warping factor a of a piece-wise linear warping function and the fundamental frequency ratio r. Here, the source voice is that with the default values a=1.0 and r=1.00.

					r
		0.50	0.63	0.79	1.00	1.26	1.59	2.00
	0.7	x	x	x	x	x	x	x
	0.8	x	x	x	x	x	x	x
	0.9	x	x	x	x	x	x	x
a	1.0	x	x	x	x	x	x	x
	1.1	x	x	x	x	x	x	x
	1.2	x	x	x	x	x	x	x
	1.3	x	x	x	x	x	x	x

Text-Independent Cross-Language Voice Conversion

So far, most of the voice conversion training algorithms require parallel training data of the source and target speaker, i. e., both speakers' utterances are based on the same text (text-dependent training). Often, this requirement is inconvenient or unrealizable. For instance, when source and target speaker use different languages (cross-language voice conversion), parallel training data can only be produced if at least one of the speakers speaks both, source and target language.
In a paper, which I published at ICASSP'06, I presented a mapping algorithm based on unit selection (a technique well-known from speech synthesis) that uses non-parallel training data and can be applied to cross-language voice conversion.
In the following example, we have an English female and a male source voice (f1 and m1) which are converted to sound like a given Spanish female or a male target voice, respectively (f2 and m2) using the above described technique.

		f2	m2
		f2a	m2a
		f2b	m2b
		f2c	m2c
	f1a	f1a-f2	f1a-m2
f1	f1b	f1b-f2	f1b-m2
	f1c	f1c-f2	f1c-m2

	m1a	m1a-f2	m1a-m2
m1	m1b	m1b-f2	m1b-m2
	m1c	m1c-f2	m1c-m2

Middle High German Voice-Converted

In the framework of a project on German medieval culture, the multimedial group at RWTH Aachen is building a feature, where visitors speak an arbitrary utterance in their mothertongue into a microphone, and after a few seconds, they hear their own voice speaking a Middle High German poem. The technology is based on a combination of the aforementioned approaches. An example is given here:

	m2f	f2m
target	f	m
source	m	f
s2t	m2f	f2m