Making Sure Your Voice Is Heard: Deep-FSMN in CTC-Based Speech Recognition

This article is part of the Academic Alibaba series and is taken from the paper entitled “Acoustic Modeling with DFSMN-CTC and Joint CTC-CE Learning” by Shiliang Zhang and Ming Lei, accepted by Interspeech 2018. The full paper can be read here.

What empowers voice assistants, such as Apple’s Siri and Amazon’s Alexa, to recognize what you are saying? Until recently, speech recognition systems, or more precisely large vocabulary continuous speech recognition (LVCSR) systems, mostly relied on various types and combinations of deep neural networks to power their acoustic models. Known as hybrid systems, they could include feedforward fully-connected neural networks (FNN), convolutional neural networks (CNN), recurrent neural networks (RNN) and long short-term memory networks (LSTM) to name but a few.

More recently, connectionist temporal classification (CTC)-based acoustic models using LSTM have achieved comparable or even better performance than hybrid systems. However, LSTMs are computationally expensive, and sometimes difficult to train with CTC criterion.

The Alibaba tech team’s approach is to replace LSTM with deep feedforward sequential memory (DFSMN) in the CTC model. Inspired by recent work on DFSMN, they are exploring how this type of non-recurrent model behaves when trained with CTC loss.

Building a Good Listener

Conventional hybrid approaches use deep neural networks to generate the individual frames of acoustic data. Their distributions are then reformulated as emission probabilities for a hidden Markov model (HMM). Model training is carried out by using the frame-level cross-entropy (CE) criterion followed by some sequence discriminative training methods such as maximum mutual information (MMI).

One problem with such systems is that the frame level training targets must be inferred from alignments determined by the HMM. This has led researchers to focus on end-to-end speech recognition systems, namely attention-based encoder-decoder and the aforementioned CTC. Both regard speech recognition as a sequence-to-sequence mapping problem and address the problem of variable-length input and output sequences.

The key idea of CTC is to use intermediate label representation allowing repetitions of labels and occurrences of blank labels to identify less informative frames. CTC-based acoustic models can automatically learn the alignments between speech frames and target labels, which removes the need for frame level training targets.

LSTM is the most popular choice for end-to-end speech recognition systems with CTC. Experimental results show that CTC-based acoustic models achieve better performance than the conventional hybrid models, and with bidirectional LSTM (BLSTM) can significantly outperform the unidirectional version.

However, there is a latency problem which prevents the application of CTC with BLSTM to low-latency online speech recognition. Also, the CTC training of both unidirectional and bidirectional LSTM required to unroll the LSTM by the length of the input sequence consumes a huge amount of memory especially when the sequence is very long.

Speaking the Same Language

To combat these problems, the Alibaba tech team has replaced the LSTM with DFSMN in CTC-based acoustic models. They evaluated the performance of DFSMN-CTC acoustic models in various LVCSR tasks that consist of 1,000, 4,000 and 20,000 hours of training data. Experimental results show that DFSMN-CTC with either CI-Phone or CD-Phone targets can significantly outperform the conventional hybrid DFSMN-CE model. Deep-FSMN (DFSMN) is an improved FSMN architecture that introduces the skip connections and the memory strides.

The joint CTC and CE learning framework for DFSMN based acoustic modeling

They also found that CTC-based acoustic models are more robust to the speed rate than CE-based models. CTC-based models can sometimes suffer from latency problems so when an output target is detected there can be quite a delay after its corresponding input event. So, the team created a joint CTC-CE learning framework (see above) by using the CTC-blank posterior as a regularization term to handle this problem. This helps to improve the stability of CTC training and the performance of DFSMN-CTC-based acoustic models.

The performance of the Joint CTC-CE trained DFSMN models on the 20,000 hours training set

In a 20,000 hours Mandarin recognition task (see above), joint CTC-CE trained FSMN achieved 11.0% and 30.1% relative performance improvement compared to DFSMN-CE models in a normal and fast speed test set respectively.

The full paper and results can be read here.

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!