Speech Technology at Alibaba

Talk to your TV set, talk to ticket-selling machines and beyond

Alibaba has deployed AI technologies, e.g., speech technology, natural language processing (NLP), video technology, image technology and machine learning) etc, to a broad range of applications in E-commerce, Financial Services, New Manufacturing, and New Retail. In diverse scenarios related to enterprise application, Alibaba has accumulated a wealth of knowledge derived from Internet-based big data, making itself a world leader in AI application.

In Speech Technology



Deep neural networks have become the dominant acoustic model used in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Neural networks include both Feed-forward Neural Networks (FNN) and Recurrent Neural Networks (RNN). Although RNNs have been shown to significantly outperform FNNs, the learning capabilities of RNNs usually rely on Back Propagation Through Time (BPTT) due to internal recurrent cycles. This significantly increases the computational complexity of learning and also may cause such problems as gradient vanishing and exploding.

Long short-term memory (LSTM) layers are the building blocks of recurrent neural networks (RNN) and are used to facilitate the application of RNNs in sequential modeling tasks, such as machine translation. Due to layer inputs, the LSTM layer assumes that the state of its current layer (as stored in the memory cell) is dependent on the state of the same layer at the previous time point. This one-step time dependency restricts the modeling capability of temporal information and represents a major constraint of LSTM layers in RNNs.

Speaker-dependent acoustic models ensure that speech synthesis systems give accurate results. Given an adequate amount of training data from target speakers, speech synthesis systems are able to generate results similar to the target speaker. However, gaining enough data from target speakers is always a constraint.

The importance of emotion recognition is gaining more and more traction with improving user experience and the engagement of human-computer interfaces (HCI). Developing emotion recognition systems that are based on speech, as opposed to facial expressions, has practical application benefits due to low hardware requirements. However, these benefits are somewhat negated by real-world background noise impairing speech-based emotion recognition performance when the system is employed in practical applications.

Text-to-Speech (TTS) systems are an essential part of human-computer interactions. For current Internet-of-Things (IoT) devices, such as smart speakers and smart TVs, speech is the most efficient and accessible approach for both the user and device to understand each other through instructions and feedback. However, one issue that commonly hampers user experience is machine generated speech being perceived as unnatural or non-human-like by users. Overcoming this obstacle has been a major challenge for TTS systems to date.

Alibaba Tech

First-hand and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!