-
Pitch Tracking
Pitch tracking (aka pitch or epoch marking) is the process of segmenting
speech signals into frames in sync with their underlying periodicity.
This segmentation is of utter importance to multiple speech processing
tasks such as speech synthesis, voice conversion, or emotion
recognition. This project is to explore algorithms available as
open-source software and look into designing own algorithms for this
task. The performance of the algorithms is to be compared on a
benchmark data set containing manual segmentation labels.
-
Speech Synthesis
Text-to-speech synthesis is the automatic conversion of written text to
audible speech. State-of-the-art speech synthesis techniques comprise
unit selection synthesis, hidden-Markov-model-based synthesis, and
MBROLA algorithms. The interactive voice response system Halef at the
Spoken Dialog System Research Center is using the open-source unit
selection synthesizer FreeTTS, whose quality is not considered adequate,
as informal listening tests have suggested. This project is to enhance
the speech quality produced by the current FreeTTS implementation.
Furthermore, it is to explore open-source alternatives to FreeTTS, e.g.,
by considering participants and results of the Blizzard Challenge, an
annual speech synthesis competition comparing the performance of dozens
of commercial and non-commercial synthesizers available in the field.
-
Statistical Language Modeling and Understanding in Spoken Dialog Systems
The interactive voice response system HALEF (Help Agent:
Language-Enabled and Free) developed at DHBW's Spoken Dialog Systems
Research Center is an open-source distributed and
industry-standard-compliant solution. It comprises tools developed at
DHBW but also at other institutions including Carnegie Mellon
University, Sun Microsystems, and Darmstadt University of Technology.
Halef is able to talk to callers via regular phone lines, smart phones,
or VoIP clients. To bring the system to the next level, Halef is to be
equipped with statistical language models and understanding, a feature
which little of the commercial implementations support to date. This
feature enables the system to process a large variety of user inputs
while achieving high recognition performance. To this end, algorithms
for semantic classification and parsing need to be implemented and
tested on openly available data sets as e.g. Carnegie Mellon
University's Let's Go Corpus.
-
Detecting Alcohol Intoxication in Speech
The Munich Alcohol Language Corpus (ALC) contains speech from persons in
intoxicated as well as in sober state. In order to classify whether a
person is intoxicated or not, several combinations of classifiers and
feature extraction approaches are to be examined including acoustic or
textual features. Among other things, it is to be investigated how word
error rate and confidence scores produced by a speech recognizer applied
to the input speech correlate with the accuracy of classification.
-
Voice Conversion: Development of an Open-Source Toolbox and Application
to Standard Databases
Voice conversion is the transformation of a source speaker's voice to
sound like a different speaker (the target speaker). An open-source
Octave toolbox recently created at the Spoken Dialog Systems Research
Center is to be enhanced by support for shimmer and jitter synthesis,
advanced prosodic matching, as well as linear transformation-based
conversion. To prove the effectiveness of the enhancements, the toolbox
is to be tested on freely available standard databases (from Oregon
Graduate Institute, Carnegie Mellon University, or others).
-
YouTube Closed Captions
Given the humongous amount of video data stored on YouTube, it is
becoming increasingly interesting to effectively search or mine video
contents. E.g., it could be of interest to see which broadcast news
report about the Cannes film festival or which part of a lecture covers
the Fourier transform. To this end, an open-source speech recognizer is
to be adapted to process large quantities of video data, providing
transcriptions and time stamps which then can be exported to YouTube as
closed captions. The accuracy of the solution is to be optimized for
transcriptions of lectures held at the DHBW. Furthermore, the
recognizer should be able to process multiple languages.
-
Reverb Challenge
One problem of automatic speech recognition (ASR) in real-world
applications are difficult acoustic conditions of the room in which the
speaker is located. In particular, reverberation can lead to a
substantial reduction of ASR performance which is why there has been an
increasing research interest in reverberant speech signal processing
over the past few years. This year, the international Reverb Challenge
was organized to systematically compare state-of-the-art techniques used
in the field. This project is to benchmark the ASR technology used at
the Spoken Dialog Systems Research Center in conjunction with the
reverberation reduction algorithms provided by a partner company.
-
Avatar
Avatars are virtual human characters used for instance on websites to
embody a person's alter ego. In addition to a graphical representation,
avatars can be equipped with natural language capabilities by way of a
spoken dialog system. This project is to provide an open-source avatar
running e.g. as a Java applet incorporating functionality provided by
DHBW's spoken dialog system Halef. In doing so, components from USC's
Virtual Humans Group as well as the newest W3C standards on multimodal
interaction are to be taken into account.
-
Speech Recognition in the Age of Cloud Computing and Ubiquitous Internet
At a time when smartphones and computation in the cloud belong to
everybody's daily vocabulary, speech recognition is witnessing an
astonishing revival. Voice search, voice operation, self-service
agents, and ubiquitous speech processing are hot topics in nowadays'
human-machine interface landscape. But how has the explosion of
computational power, internet connection speed, and amount of available
training data affected the performance of speech recognizers?
This project is to compare a multitude of different speech recognizers
across several dimensions by running extensive recognition batch tests
based on hundreds of thousands of test utterances. Dimensions of
particular interest include
- recognition performance (word error rate),
- recognition speed,
- footprint (memory, hard disk space),
- platform (desktop/server/smartphone/cloud), and
- license (commercial/free-of-charge/open-source).
-
Development of an Open-Source Voice Browser Prototype
A web browser communicates with a human user by means of keyboard,
mouse, camera, etc. (input) and screen, loudspeaker, etc. (output)
interpreting contents of HTML pages. Similarly, a voice browser is a
software communicating with a human user by means of voice (input and
output) interpreting contents of VoiceXML pages. Such pages can contain
instructions on what the browser is supposed to say (e.g., How may I
help you today?) and how to handle a human's speech input (e.g., I would
like to buy a heavy metal guitar). In this function, voice browsers
serve as interface between
- speech recognizer,
- text-to-speech synthesizer,
- telephony network, and
- web server.
Voice browsers are essential in commercial voice-user interaction
systems (aka spoken dialog systems) processing billions of calls every
week. As a consequence, voice browsers are proprietary software
packages developed by specialized software companies.
This project is to develop an open-source prototype of a voice browser
interacting with open-source components for (a) to (d), for example from
Carnegie Mellon University (a to c), or Apache (d). Foundations to this
have been laid in former research projects which established a
distributed and virtualized infrastructure with speech recognizer and
synthesizer.
-
Emotion Analysis of Speech in Human-Machine Phone Conversations
Many customer service interactions are nowadays carried out using spoken
dialog systems (SDSs) replacing the role of a human agent. Unlike the
latter, SDSs are principally unable to tell when a caller gets
frustrated. This is one of the main reasons why callers usually dislike
speaking to an SDS rather than to a live agent.
The purpose of this project is to analyze a variety of features
(acoustic features, call history, speech recognition and understanding
hypotheses, confidence values, and so on) in an attempt to predict the
emotional state of call or caller. The envisioned emotion predictor
could cause a call to be escalated to a human agent when severe
frustration is detected.
-
Reverse-Engineering Siri's Spoken Language Understanding Component
Siri is the iPhone's voice control assistant which is able to understand
users' natural language queries, execute them, and give a spoken
response. Voice control assistants such as Siri feature a spoken
language understanding (SLU) component that uses the text output of a
speech recognizer and extracts semantic entities which are then sent to
a dialog manager for execution.
This study work is to engineer an open-source SLU component for a voice
control assistant. In doing so, several concepts including rule-based
semantic grammars, semantic classification, named-entity tagging, and
semantic parsing are to be compared. Using speech recognition and
synthesis infrastructure provided by former research projects, the
proposed SLU components can be tested in a real-world environment.
-
Holmes: Reverse-Engineering Watson
In the past years, a lot of progress has been made towards modeling the
human capacity of answering open-ended questions. The most prominent
example is certainly IBM's Watson that successfully competed with the
former champions of the quiz Jeopardy! on U.S. television. Even though
Watson's performance is undoubted, being a commercial product,
architecture and underlying data are not available for exploitation by
the academic community.
This project is to establish an open-source text-based question
answering (QA) system with an initially limited scope and performance
benchmark. Due to restriction to a specific domain (e.g. preparation for
exams at DHBW), the initial system will exhibit a reasonable performance
which is subject to improvement by way of
continuous data collection,
application testing by DHBW's student body as well as crowdsourcing, and
adoption of more and more sophisticated QA techniques suited for the domain.
-
Interrogator
Did you ever feel a little unprepared in an oral exam facing the
mercilessness of your almighty professor, his highness? Or a police
officer interrogating you about what you are messing around with this
allegedly stolen vehicle? What about the U.S. border protection
personnel in whose presence one should always show the right amount of
diligence in responding to their questions?
The project at hand (Interrogator) is to build a spoken dialog system
(SDS) that engages you in a conversation similarly stressful as those
described above. The Interrogator is to prepare you for these real-world
cases providing you the right amount of domain knowledge and stimulating
your self-confidence to optimally manage said situations.
Interrogator is to be built upon the open-source SDS framework Halef
running at DHBW Stuttgart enhancing the baseline system by
- encoding the necessary domain models (for speech recognition and
understanding as well as dialog management),
- implementing the required expressiveness (or, better, harshness) of
the system's output voice, and
- building a tangible demonstrator system.
-
Machine Learning in the Medical Domain
Fundamental machine learning algorithms for classification, regression,
and clustering are to be applied to a medical field of the student's
choice. Possible areas include (but are not limited to)
- detection of breast cancer from digitized images
- DNA microarray analysis for cancer classification
- DNA repair recognition
- DNA splicing boundary detection
- prediction of Parkinson's disease
For this research, freely available corpora can be deployed (such as the
Wisconsin Breast Cancer Data Set or the Molecular Biology
Splice-Junction Gene Sequences Data Set).
-
Classification of Mental States Using EEG
At DHBW, preliminary studies of the classification of eye aperture
measuring brain waves with an EEG have produced very promising results.
The next step is to investigate whether more complex states can be
robustly distinguish, e.g., which characters subjects are seeing on a
screen, which words they intend to pronounce, or which topic they are
thinking about. To this end, state-of-the-art feature extraction and
machine learning techniques are to be applied, and both consumer
and medical EEGs are to be compared.
|