The Pacific--

Baidu US Laboratory Breakthrough in Speech Recognition

MIT’s Technology Review magazine recently identified 10 breakthrough technologies that are most likely to solve big problems and open up new commercial opportunities in 2016. One of the ten breakthrough technologies is conversational interface of Baidu.

Voice interfaces have been a dream of scientists for many decades and a key focus of Artificial Intelligence research ever since the discipline was founded. In recent years, some impressive advances in machine learning have put voice operated control in many consumer electronics. Voice operated virtual assistants such as Apple's Siri, Microsoft's Cortana, Google Now and Amazon’s Alexa come bundled with most smartphones and other devices to allow a simple way to look up information, cue up songs and build shopping lists with the user’s voice. While these systems are not perfect, often misinterpreting commands in comical way, they are improving steadily and hopefully will be easy to use one day.

Late last year, Baidu announced a breakthrough product in its Silicon Valley laboratory, a speech recognition engine developed to handle Chinese Putonghua called Deep Speech 2.

Deep Speech 2, the speech recognition network developed by Baidu, consists of a very large, or “deep,” neural network that learns to associate sounds with words and phrases while being fed millions of examples of transcribed speech. Deep Speech 2 can recognise spoken words with stunning accuracy. It can sometimes transcribe snippets of Putonghua better than a person.

Baidu's progress is impressive because Chinese Putonghua is phonetically complex and uses four tones that can transform the meaning of a word. Deep Speech 2 is also striking because few of the researchers in the California lab where the technology was developed speak Chinese Putonghua or any other variant of Chinese. The engine essentially works as a universal speech system, learning English just as well when fed enough examples.

Most of the voice commands that Deep Speech 2 handles today are simple queries concerning the weather or pollution levels, for example. For these, the system is impressively accurate. Increasingly, however, users are asking more complicated questions. Baidu launched its own voice assistant, called Duer last year as part of its main mobile app. in China. Duer can handle more complicated tasks such as helping users find cinema screening times or book tables at restaurants.

The big challenge for Baidu will be teaching its AI system to understand and respond intelligently to even more complicated spoken phrases. A research group at Baidu's Beijing research laboratory is working on refining the system that interprets users' queries. This involves using the kind of neural-network technology that Baidu has applied in voice recognition and perhaps other forthcoming breakthroughs. Eventually, Baidu would like Duer to take part in meaningful back-and-forth conversations, incorporating changing information into the discussions.

China is an ideal place for voice interfaces to take off, because Chinese characters were hardly designed with tiny touch screens in mind. Smartphones are more common in China than anywhere else. A good voice interface can enhance the functionality of a smartphone and make it the center of future digital convergence. People all over the world will benefit as Baidu advances speech technology and makes voice interfaces more practical and useful. In fact, service robots and smart home appliances are two areas which many people expect a market explosion once voice recognition is reliable enough to be used for interactions. Robots and smart appliances are easier to deal with if you could simply talk to them.