Baidu's Deep Voice can quickly synthesize realistic human speech

Baidu has been quietly working on other projects besides self-driving cars at its AI center in Silicon Valley, and now it has revealed one of them to MIT’s Technology Review. Apparently, the Chinese tech titan has created a text-to-speech system called Deep Voice that’s faster and more efficient than Google’s WaveNet. The company says Deep Voice can be trained to speak in just a few hours with little to no human interaction. And since Baidu can control how it speaks to convey different emotions, it can (quickly) synthesize speech that sounds pretty natural and realistic.

Google’s WaveNet can also synthesize realistic human speech, but it’s quite computationally demanding and hard to use for real-world applications at this point. Baidu says it solved WaveNet’s problem by using deep-learning techniques to convert text to phenomes, the smallest unit of speech. It then turns those phonemes into sounds using its speech synthesis network. The system converts the word “hello,” for instance, into “(silence HH), (HH, EH), (EH, L), (L, OW), (OW, silence)” before the speech network pronounces it.

Both steps rely on deep learning and don’t need human input. However, the system doesn’t control which phonemes or syllables are stressed and how long they’re pronounced. That’s where Baidu steps in — it switches them around to change the emotions it wants to convey.

While the company says Deep Voice has solved WaveNet’s problem, it still requires a ton of computing power. A computer has to generate words to say in 20 microseconds to mimic human-like interaction. Baidu’s researchers explain:

“To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units.”

Still, the researchers believe real-time speech synthesis is possible. They’ve already created quickly generated samples and collected feedback through Amazon’s Mechanical Turk. They asked a large number of people through the service to rate the quality of their samples, and the results indicate that they’re of excellent quality.

Source: MIT Technology Review

Source: Engadget - Read the full article here

Author: Daily Tech Whip

This article is part of our 'News Tiles' service. The site is currently in Beta. When it is fully operational you will be able to search through and arrange the 'Tiles' to display a keyword, product or technology over your chosen time period. For example you would be able to display all of the leading tech articles on the new Kindle Fire, in one spot in real time. You will also have access to our own original reporting and analysis as well as a polished place to post your own thoughts & reviews here, amongst the Daily Tech Whip Community. Please let us know if you have any feedback via the contact form or via Twitter. Don't forget to come back next week and see our full site and claim your name and your own free tech blog.

Share This Post On