This article is part of the Academic Alibaba series and is taken from the AAAI 2019 paper entitled “Deep Cascade Multi-task Learning for Slot Filling in Online Shopping Assistant” by Yu Gong, Xusheng Luo, Yu Zhu, Wenwu Ou, Zhao Li, Muhua Zhu, Kenny Q. Zhu, Lu Duan, and Xi Chen. The full paper can be read here.
Sometimes, an unusually difficult experience of a problem can inspire an exceptionally novel solution. For Alibaba, efforts to improve Natural Language Understanding (NLU) for its online shopping assistants involved semantic challenges unique to the Chinese language. Now, the results are showing how a stronger mechanism for grasping its famously challenging syntax can boost systems’ core ability to extract meanings from shopper requests.
In e-commerce, NLU systems can help automate services like product recommendations, after-sale service, and complaint processing by way of online shopping assistants. At their core are slot filling systems that extract word-level semantics from user requests to place in slots in a predefined framework. Meanwhile, shoppers’ everyday needs involve a range of products and variable attributes such as brand, color, and style far beyond standard training data. Factoring in the unique syntax of Chinese expressions, this strongly challenges systems’ ability to filter varied inputs into simple semantic categories.
To overcome these difficulties, researchers at Alibaba have developed a novel deep cascade multi-task learning model for slot filling, as well as an original E-commerce Shopping Assistant (ECSA) dataset for training and evaluation. With the key innovation of Deep Neural Network-based cascade and residual connections that serve as shortcuts between high- and low-level tasks, the model has already outperformed previous state-of-the-art approaches to enter online production in the Alibaba ecosystem.
Slot Filling for Online Shopping: Beyond Sequence Labeling
Slot filling for e-commerce NLU involves sorting user inputs based on three key types of linguistic expressions: category, property key, and property value.
Category (CG) terms are words for general item categories like “dress” or “t-shirt”; property key (PK) refers to descriptive words like “brand” or “color”; and property value (PV) refers to actual specifics such as brand names, colors, and other adjectives that describe items. In addition, complex scenarios require labeling input words with qualifiers like “in”, “out”, and “begin” (I/O/B) to describe their semantic positions, as well as terms that further specify how these words are used, such as in brand and color descriptions. “O” is used to indicate that words have no semantic meaning the system should consider.
Previously, sequence labeling approaches have centered on BiLSTM-CRF (Bi-Directional Long Short-Term Memory with Conditional Random Field) models, with the standard dataset ATIS as the go-to resource for training and testing. As well as the model’s limitations, this approach suffers from the simplicity of the ATIS dataset, which is based on airline travel scenarios and focuses almost entirely on time and location data. With it, confirming an advantage in models that depart from traditional end-to-end sequencing becomes effectively impossible.
As a Chinese-language platform, Alibaba recognized a similarity between the challenges of complex e-commerce terminology and the non-word-segmented structures in expressions with Chinese characters. Building from previous advanced multi-task learning models, which share parameters among different tasks, researchers introduced a cascade connection and a residual connection to serve as shortcuts in a more closely coupled, efficient model. This enabled extraction of two syntactic labeling tasks — named entity tagging, and segment tagging — that together allow the slot filling system to make syntactic sense of input expressions before proceeding to semantic analysis.
Building A Better Dataset
To improve on the ATIS dataset of just 4,978 training and 893 testing sentences, researchers used an unsupervised method to automatically tag input expressions from actual user logs on Alibaba’s e-commerce platforms. To evaluate the proposed model’s ability to generalize, the team split the database dictionary into three parts at random, using one for testing and two for training. This prevented the model from developing a memory of the dataset which it could apply in tests while encountering a total of 24,892 training pairs and 2,723 testing pairs.
Online tests with the model showed an overall improvement of 130% accuracy and a 14.6% advantage in F1 performance over strong baseline models. Based on these results, the model has now entered online operation in the Alibaba Group’s production environment.
The full paper can be read here.