Speech scientists are dead. Interaction designers are dead. Who is next?

David Suendermann

SLTC Newsletter, April 2010

This article describes how the role of those professionals usually required to build commercial spoken dialog systems is changing in the dawning age of rigorously exploiting petabytes of data.

Commercial spoken dialog systems can process millions of calls per week producing considerable savings for the enterprises deploying them. Due to the high traffic hitting such applications, fluctuation of savings can be considerable with varying automation rates and call durations, two factors directly impacting the financial business model behind many commercial deployments. Accordingly, the main goals of system design, implementation, and tuning include the maximization of automation rate and minimization of average handling time, or a combination of those in a reward function.

Actions to drive system performance include tuning grammars, enhancing system functionality, rewording prompts, putting activities in a different order, etc. Traditionally, these actions are performed by speech scientists and interaction designers that implement changes according to best practices guidelines and their own long-standing experience. The impact of changes can be measured by comparing automation rates and handling times before and after the fact. To overcome the effects of performance variation due to external factors such as the caller population or the weekday up-and-down, a more precise comparison can be carried out by deploying both the baseline and the `improved' system at the same time and distributing the call mass between them. Data needs to be collected until the performance difference is found to be statistically significant.

Since the applications we are talking about in this article can be highly trafficked, statistical significance of the performance difference between two systems can normally be found after a relatively short period of time, so, there is nothing preventing us from exploring more than two systems at a time. In fact, one can implement a variety of changes at different points in the application and randomly choose one competitor every time the point is hit in the course of a call. This `contender' approach can be used to enlighten arbitrary uncertainties coming up during the design phase, as e.g. which prompt, which disambiguation strategy, which order, which grammar, which parameter setting is better. Based on observed performance differences and the amount of traffic hitting each contender condition, the call mass going to each of the conditions can be adjusted on an ongoing basis to optimize the overall reward of the application while awaiting statistically significant results.

Now, the contender approach can change the life of interaction designers and speech scientists in that best practices and experience-based decisions can be replaced by straight-forward implementation of every alternative one can think of. Is directed dialog best in this context? Or open prompt? Open prompt given an example? Or two? Or open prompt but offering a backup menu? Or a yes/no question followed by an open prompt when the caller says no? What are the best examples? How much time should I wait before I offer the backup menu? Which is the ideal confirmation threshold? What about the voice activity detection sensitivity? When should I time out? What is the best strategy following a no-match? Touch-tone in the first or only in the second no-match prompt? Or should I go directly to the backup menu after a no-match? What in the case of a time-out? Et cetera. Nobody needs to know the answer from one's gut feeling. Data will tell.

Contender (or what the research community also refers to as reinforcement learning) is only one of the many varieties of techniques one can use to optimize application performance based on massive amounts of available data. One can also study how likely a call will end up being escalated to an agent, and, upon reaching a certain threshold, one can escalate to keep non-successful calls as short as possible. Data will tell.

Also the duration of automated calls may be strongly reducible by reorganizing activities by their information gain measured on live data. We ask (and back-end-query) information items in the order of relevance which, in turn, minimizes the average number of gathered information items. Data will tell.

The never-ending headache of speech scientists how to overcome the omni-present weakness of speech recognition can also be healed by data. Instead of carefully tweaking rule-based grammars, user dictionaries, and confidence thresholds, there is a lazy but high-performing recipe. One needs to systematically collect large numbers of utterances from all the contexts of a spoken dialog system, transcribe these utterances, annotate them for their semantic meaning, and train statistical language models and classifiers to replace grammars that have been used in these recognition contexts before. The performance improvement can be predicted offline on separate test sets, and confidence thresholds can be tuned. Data will tell.

Using the above methods and other techniques exploiting production data of spoken dialog systems on a large scale, speech scientists and interaction designers are getting lazy and can spend more time at coffee breaks, Friday afternoon bashes, and Midtown skyscraper parties while servers are running hot doing our job.

David Suendermann is the principal speech scientist of SpeechCycle and focuses his research on spoken dialog systems, voice conversion, and machine learning.    Email: david@speechcycle.com    WWW: suendermann.com