This article is part of the Academic Alibaba series and is taken from the paper entitled “Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting” by Mengzhe Chen, Shiliang Zhang, Ming Lei, Yong Liu, Haitao Yao, and Jie Gao, accepted by Interspeech 2018. The full paper can be read here.
From Siri to Cortana to Alexa, voice assistants are rapidly becoming a part of everyday life. These voice assistants are largely powered by keyword spotting, the task of detecting pre-defined words in an audio stream.
The traditional solution for keyword spotting, or large vocabulary continuous speech recognition (LVCSR), takes audio input and performs keyword searches through the resulting lattice. However, due to limited device resources and complications that may arise in real-world audio scenarios, the need for a compact model with high precision, low computational cost, and low latency is growing. In a world where keyword spotting is only getting faster, LVCSR can’t keep pace.
Alibaba’s tech team has proposed a compact Feedforward Sequential Memory Network (cFSMN), combining low-rank matrix factorization with conventional FSMN. With the addition of multiframe prediction, the system allows for faster and more effective keyword recognition.
Listening and Learning
Many models are specified to use words as a modeling unit. However, researchers wanted cFSMN to be flexible enough to add new keywords with product development, in effect learning new words. Researchers designed the cFSMN system to use senones as a modeling unit. The decoding graph consists of keyword paths and background paths. Each keyword path consists of a sequence of hidden Markov models (HMMs) for one keyword. Background paths are built for non-keyword speech, ambient noise, and silence.
A standard FSMN is equipped with learnable memory blocks in hidden layers. The proposed compact FSMN adopts low-rank matrix factorization, reducing computational cost without sacrificing performance. This method can be more efficiently and reliably trained with backpropagation than recurrent neural networks. Multiframe prediction serves to significantly speed up training and decoding procedures.
Listening in the Field
The method was trained on a four-syllable Mandarin Chinese keyword, using a 24,000-hour simulation set and two real-recorded test sets in both quiet and noisy environments as positive examples. To evaluate the system’s false alarm rate, it was also exposed to 600 hours of negative examples, sourced from anything from music to broadcast news.
From there, the cFSMN system was evaluated by its false reject rate, false alarm rate, latency, and computational cost by floating-point operations per second (FLOPS). In the experiment, deep neural network (DNN) and long short-term memory (LSTM) systems served as baselines for comparison. Compared to a well-tuned LSTM, which needs the same latency and twice the computational cost, the cFSMN achieves 18.11% and 29.21% AUC relative decreases on the test sets which are recorded in quiet and noisy environments respectively. cFSMN similarly outperforms DNN systems. Combined with MFP, cFSMN can obtain a lower FLOPS without performance decline.
The full paper can be read here.