Next-Gen Text Generation: Alibaba Makes Progress on the KL-Vanishing Problem

This article is part of the Academic Alibaba series and is taken from the ICASSP 2019 paper entitled “Improve Diverse Text Generation by Self Labeling Conditional Variational Auto Encoder” by Yuchi Zhang, Yongliang Wang, Liping Zhang, Zhiqiang Zhang, and Kun Gai. The full paper can be read here.

In conventional manufacturing, a machine that can make the same product over and over again meets the basic standards of automation. In the changing world of neural networks, though, automated systems often need to generate results that are as diverse as they are accurate in order to meet demands for variety in settings like e-commerce.

As one such field, text generation has the potential to automate tasks traditionally reserved for human workers, such as creating marketing copy from information about products sold online. Where these encoder-decoder systems have become highly accurate at capturing product details, though, they continue to lag in the variety of ways they can express them — a challenge known to developers as the one-to-many problem. Adding to the complexity, efforts to introduce variables that encourage diverse responses have faced a peculiar problem; in a phenomenon known as KL-vanishing, decoder mechanisms tend to model targets without making use of these variables, instead finding workarounds to stay within narrower parameters.

Now, Alibaba researchers have advanced a Conditional Variational Auto Encoder (CVAE) model with a labeling network that guides the encoder to the ideal parameter for a desired response range. In tests with both human and automatic quantitative evaluations, the model has so far proved a rival to baseline models in accuracy while being their clear superior at generating a variety of viable responses.

Beyond Heuristics: Tackling the KL-Vanishing Problem

Rather than an inherent challenge of text generation, KL-vanishing is unique to models that emerged to solve basic problems in text generation. Whereas early models like SEQ2SEQ can only encode a given input pattern to a single set of representative vectors (limiting diversity), more recent Variational Auto Encoder (VAE) and Conditional VAE (CVAE) models have sought to improve diversity through an intermediate latent variable with various configurations that correspond to possible text responses. The KL-vanishing problem occurs when these different configurations collapse into a single decoding distribution, concentrating their respective responses within a narrow range (i.e., causing a high degree of textual similarity).

An illustration of the KL-vanishing problem. Latent variable configurations in the encoder (denoted by “z”) should ideally lead to diverse responses (denoted by “x”), as shown in (a), but instead collapse into a single decoding distribution as shown in (b).

Previous work has sought to resolve this problem through heuristic approaches, either by weakening the decoder or by strengthening the encoder arbitrarily, but has struggled to gauge how weak or strong these components should be.

Rather than arbitrary adjustments, Alibaba’s researchers sought a mechanism that would automatically determine the optimal encoder parameter for a desired range of responses in the form of a labeling network module. In short, the module works backward from a target output “x” in the decoder to determine the corresponding “z” latent variable that will generate the desired response range, identifying this inverse image of the x output by estimating the effectiveness label of the z variable. This auto-labeling approach very closely approximates the ideal encoder for the decoder currently in use in a given task.

Technical overview of the proposed model

Human-Approved Performance

To test the proposed model, researchers applied two datasets reflecting different criteria for text generation, creating one of these from scratch.

For the EGOODS dataset, researchers gathered a large-scale corpus of item descriptions from Chinese e-commerce platforms in which each item featured one description provided by the seller and multiple third-party recommendation phrases meant to improve its commercial appeal. This provided a large and highly native one-to-many dataset for tests of the models’ ability to generate recommendation texts.

By contrast, the previously existing Daily Dialog dataset provided a basis for tests with information better suited to one-to-one problems, due to inherent qualities of the question-response patterns of its dialogue contents. This set was applied in tests of models’ ability to formulate ten responses to a given question in a dialogue context.

In trials, models competed on metrics for accuracy and diversity, for which researchers developed automatic quantitative measures based on BLEU-recall metrics. The models further underwent human evaluation based on average scores from seven subject matter experts who ranked the fluency, coherence, and diversity of their responses.

Results of human evaluation of the proposed model (bottom) and three baseline models

Results indicate that the proposed model outperforms all others in automatic measurement criteria using four BLEU metrics. Most notably, results from human evaluators indicate that while it has closely matched benchmark models in accuracy criteria, the model significantly outperforms competing models in the diversity of its responses.

The full paper can be read here.

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!