This article is part of the Academic Alibaba series and is taken from the ICASSP 2019 paper entitled “Sequential Matching Model for End-to-End Multi-Turn Response Selection” by Qian Chen and Wen Wang. The full paper can be read here.
For e-commerce platforms, smart dialogue systems offer a powerful way to automate services and drive purchases, making end-to-end solutions that can converse with customers a key focus of their development. Among such capabilities, multi-turn response selection enables systems to respond to users not only in terms of their most recent statement but also the entire exchange leading up to it, effectively accounting for implied meanings carried over from previous “turns” of a dialogue. While this can better capture the semantics of human speech, though, it also requires systems to recall and process many expressions simultaneously, challenging efforts to design an optimal solution from a range of competing approaches.
Now, researchers at Alibaba have advanced a sequence-based approach that significantly outperforms all previous multi-turn models. Rather than advancing an original method, their work further develops one widely regarded as inferior to more recent hierarchy-based models, beating the latter’s state-of-the-art performance in trials with benchmark datasets drawn from real-world scenarios.
Competing Retrieval Models
Like hierarchy-based models, sequence-based models belong to the broader category of retrieval-based dialogue systems, which work to select the best response from a wide array of predefined options.
Hierarchy-based methods use neural networks to explicitly model interactions between different expressions in a dialogue and model their contextual relationships. This use of networks reaches a high level of complexity and involves inherent tradeoffs, especially where models require truncating the length of each user expression to a set maximum. On the one hand, setting this maximum to a relatively long sentence length introduces high compute and memory costs; on the other, setting it to a shorter length risks cutting out important information. Despite these drawbacks, researchers have tended to claim these methods as superior to sequence-based methods, successfully applying them in the most recent state-of-the-art systems.
To dispel this bias against sequence-based models, Alibaba’s researchers incorporated an ESIM (Enhanced Sequential Inference Model) design originally meant to determine whether a hypothesis sentence can be inferred from a premise sentence. Their proposed model further uses three components for input encoding, local matching, and matching composition, respectively, effectively adapting the ESIM from a natural language inference context to support e-commerce dialogue systems.
To evaluate the enhanced sequence-based model, researchers applied the benchmark Ubuntu and E-commerce datasets, drawn from Ubuntu Internet Relay Chat logs and real customer conversations with Taobao customer service staff, respectively. In both sets’ cases, the multi-turn context was concatenated, and two special tokens (“_ _eou_ _” and “_ _eot_ _”) were inserted to indicate “end of utterance” and “end of turn”.
In tests, the model faced off against competitors in three groups. The first consisted of sentence encoding-based methods, in which hand-craft or neural network features were used to encode both context and response and a cosine or MLP classifier was used to define their relationship. The second group included other sequence-based matching models, which generally outperformed the first group. Finally, the third group included the heavily favored hierarchy-based models. As hoped, the proposed ESIM model outperformed all models in all groups, countering the claim that the third group of models alone could support top-performing systems.
Following these results, Alibaba’s researchers hope to further explore the potential for applying external knowledge such as user profile data in multi-turn response selection.
The full paper can be read here.