David Suendermann-Oeft - Teaching
Classes      Projects     

Classes  
 


Projects  
 
  1. Pitch Tracking

    Pitch tracking (aka pitch or epoch marking) is the process of segmenting speech signals into frames in sync with their underlying periodicity. This segmentation is of utter importance to multiple speech processing tasks such as speech synthesis, voice conversion, or emotion recognition. This project is to explore algorithms available as open-source software and look into designing own algorithms for this task. The performance of the algorithms is to be compared on a benchmark data set containing manual segmentation labels.


  2. Speech Synthesis

    Text-to-speech synthesis is the automatic conversion of written text to audible speech. State-of-the-art speech synthesis techniques comprise unit selection synthesis, hidden-Markov-model-based synthesis, and MBROLA algorithms. The interactive voice response system Halef at the Spoken Dialog System Research Center is using the open-source unit selection synthesizer FreeTTS, whose quality is not considered adequate, as informal listening tests have suggested. This project is to enhance the speech quality produced by the current FreeTTS implementation. Furthermore, it is to explore open-source alternatives to FreeTTS, e.g., by considering participants and results of the Blizzard Challenge, an annual speech synthesis competition comparing the performance of dozens of commercial and non-commercial synthesizers available in the field.


  3. Statistical Language Modeling and Understanding in Spoken Dialog Systems

    The interactive voice response system HALEF (Help Agent: Language-Enabled and Free) developed at DHBW's Spoken Dialog Systems Research Center is an open-source distributed and industry-standard-compliant solution. It comprises tools developed at DHBW but also at other institutions including Carnegie Mellon University, Sun Microsystems, and Darmstadt University of Technology. Halef is able to talk to callers via regular phone lines, smart phones, or VoIP clients. To bring the system to the next level, Halef is to be equipped with statistical language models and understanding, a feature which little of the commercial implementations support to date. This feature enables the system to process a large variety of user inputs while achieving high recognition performance. To this end, algorithms for semantic classification and parsing need to be implemented and tested on openly available data sets as e.g. Carnegie Mellon University's Let's Go Corpus.


  4. Detecting Alcohol Intoxication in Speech

    The Munich Alcohol Language Corpus (ALC) contains speech from persons in intoxicated as well as in sober state. In order to classify whether a person is intoxicated or not, several combinations of classifiers and feature extraction approaches are to be examined including acoustic or textual features. Among other things, it is to be investigated how word error rate and confidence scores produced by a speech recognizer applied to the input speech correlate with the accuracy of classification.


  5. Voice Conversion: Development of an Open-Source Toolbox and Application to Standard Databases

    Voice conversion is the transformation of a source speaker's voice to sound like a different speaker (the target speaker). An open-source Octave toolbox recently created at the Spoken Dialog Systems Research Center is to be enhanced by support for shimmer and jitter synthesis, advanced prosodic matching, as well as linear transformation-based conversion. To prove the effectiveness of the enhancements, the toolbox is to be tested on freely available standard databases (from Oregon Graduate Institute, Carnegie Mellon University, or others).


  6. YouTube Closed Captions

    Given the humongous amount of video data stored on YouTube, it is becoming increasingly interesting to effectively search or mine video contents. E.g., it could be of interest to see which broadcast news report about the Cannes film festival or which part of a lecture covers the Fourier transform. To this end, an open-source speech recognizer is to be adapted to process large quantities of video data, providing transcriptions and time stamps which then can be exported to YouTube as closed captions. The accuracy of the solution is to be optimized for transcriptions of lectures held at the DHBW. Furthermore, the recognizer should be able to process multiple languages.


  7. Reverb Challenge

    One problem of automatic speech recognition (ASR) in real-world applications are difficult acoustic conditions of the room in which the speaker is located. In particular, reverberation can lead to a substantial reduction of ASR performance which is why there has been an increasing research interest in reverberant speech signal processing over the past few years. This year, the international Reverb Challenge was organized to systematically compare state-of-the-art techniques used in the field. This project is to benchmark the ASR technology used at the Spoken Dialog Systems Research Center in conjunction with the reverberation reduction algorithms provided by a partner company.


  8. Avatar

    Avatars are virtual human characters used for instance on websites to embody a person's alter ego. In addition to a graphical representation, avatars can be equipped with natural language capabilities by way of a spoken dialog system. This project is to provide an open-source avatar running e.g. as a Java applet incorporating functionality provided by DHBW's spoken dialog system Halef. In doing so, components from USC's Virtual Humans Group as well as the newest W3C standards on multimodal interaction are to be taken into account.


  9. Speech Recognition in the Age of Cloud Computing and Ubiquitous Internet

    At a time when smartphones and computation in the cloud belong to everybody's daily vocabulary, speech recognition is witnessing an astonishing revival. Voice search, voice operation, self-service agents, and ubiquitous speech processing are hot topics in nowadays' human-machine interface landscape. But how has the explosion of computational power, internet connection speed, and amount of available training data affected the performance of speech recognizers? This project is to compare a multitude of different speech recognizers across several dimensions by running extensive recognition batch tests based on hundreds of thousands of test utterances. Dimensions of particular interest include
    • recognition performance (word error rate),
    • recognition speed,
    • footprint (memory, hard disk space),
    • platform (desktop/server/smartphone/cloud), and
    • license (commercial/free-of-charge/open-source).


  10. Development of an Open-Source Voice Browser Prototype

    A web browser communicates with a human user by means of keyboard, mouse, camera, etc. (input) and screen, loudspeaker, etc. (output) interpreting contents of HTML pages. Similarly, a voice browser is a software communicating with a human user by means of voice (input and output) interpreting contents of VoiceXML pages. Such pages can contain instructions on what the browser is supposed to say (e.g., How may I help you today?) and how to handle a human's speech input (e.g., I would like to buy a heavy metal guitar). In this function, voice browsers serve as interface between
    • speech recognizer,
    • text-to-speech synthesizer,
    • telephony network, and
    • web server.
    Voice browsers are essential in commercial voice-user interaction systems (aka spoken dialog systems) processing billions of calls every week. As a consequence, voice browsers are proprietary software packages developed by specialized software companies. This project is to develop an open-source prototype of a voice browser interacting with open-source components for (a) to (d), for example from Carnegie Mellon University (a to c), or Apache (d). Foundations to this have been laid in former research projects which established a distributed and virtualized infrastructure with speech recognizer and synthesizer.


  11. Emotion Analysis of Speech in Human-Machine Phone Conversations

    Many customer service interactions are nowadays carried out using spoken dialog systems (SDSs) replacing the role of a human agent. Unlike the latter, SDSs are principally unable to tell when a caller gets frustrated. This is one of the main reasons why callers usually dislike speaking to an SDS rather than to a live agent.

    The purpose of this project is to analyze a variety of features (acoustic features, call history, speech recognition and understanding hypotheses, confidence values, and so on) in an attempt to predict the emotional state of call or caller. The envisioned emotion predictor could cause a call to be escalated to a human agent when severe frustration is detected.


  12. Reverse-Engineering Siri's Spoken Language Understanding Component

    Siri is the iPhone's voice control assistant which is able to understand users' natural language queries, execute them, and give a spoken response. Voice control assistants such as Siri feature a spoken language understanding (SLU) component that uses the text output of a speech recognizer and extracts semantic entities which are then sent to a dialog manager for execution.

    This study work is to engineer an open-source SLU component for a voice control assistant. In doing so, several concepts including rule-based semantic grammars, semantic classification, named-entity tagging, and semantic parsing are to be compared. Using speech recognition and synthesis infrastructure provided by former research projects, the proposed SLU components can be tested in a real-world environment.


  13. Holmes: Reverse-Engineering Watson

    In the past years, a lot of progress has been made towards modeling the human capacity of answering open-ended questions. The most prominent example is certainly IBM's Watson that successfully competed with the former champions of the quiz Jeopardy! on U.S. television. Even though Watson's performance is undoubted, being a commercial product, architecture and underlying data are not available for exploitation by the academic community.

    This project is to establish an open-source text-based question answering (QA) system with an initially limited scope and performance benchmark. Due to restriction to a specific domain (e.g. preparation for exams at DHBW), the initial system will exhibit a reasonable performance which is subject to improvement by way of continuous data collection, application testing by DHBW's student body as well as crowdsourcing, and adoption of more and more sophisticated QA techniques suited for the domain.


  14. Interrogator

    Did you ever feel a little unprepared in an oral exam facing the mercilessness of your almighty professor, his highness? Or a police officer interrogating you about what you are messing around with this allegedly stolen vehicle? What about the U.S. border protection personnel in whose presence one should always show the right amount of diligence in responding to their questions?

    The project at hand (Interrogator) is to build a spoken dialog system (SDS) that engages you in a conversation similarly stressful as those described above. The Interrogator is to prepare you for these real-world cases providing you the right amount of domain knowledge and stimulating your self-confidence to optimally manage said situations.

    Interrogator is to be built upon the open-source SDS framework Halef running at DHBW Stuttgart enhancing the baseline system by
    • encoding the necessary domain models (for speech recognition and understanding as well as dialog management),
    • implementing the required expressiveness (or, better, harshness) of the system's output voice, and
    • building a tangible demonstrator system.


  15. Machine Learning in the Medical Domain

    Fundamental machine learning algorithms for classification, regression, and clustering are to be applied to a medical field of the student's choice. Possible areas include (but are not limited to)
    • detection of breast cancer from digitized images
    • DNA microarray analysis for cancer classification
    • DNA repair recognition
    • DNA splicing boundary detection
    • prediction of Parkinson's disease
    For this research, freely available corpora can be deployed (such as the Wisconsin Breast Cancer Data Set or the Molecular Biology Splice-Junction Gene Sequences Data Set).


  16. Classification of Mental States Using EEG

    At DHBW, preliminary studies of the classification of eye aperture measuring brain waves with an EEG have produced very promising results. The next step is to investigate whether more complex states can be robustly distinguish, e.g., which characters subjects are seeing on a screen, which words they intend to pronounce, or which topic they are thinking about. To this end, state-of-the-art feature extraction and machine learning techniques are to be applied, and both consumer and medical EEGs are to be compared.