Virtualizing China’s Biggest Online Marketplace for Training Reinforcement Learning

This article is part of the Academic Alibaba series and is taken from the paper entitled “Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Reinforcement Learning.” by Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen and An-Xiang Zeng. The full paper can be read here.

Reinforcement learning methods offer huge potential for complex user environments, but they are difficult to apply in many real-world settings because they require training in a live system. This risks compromising system functionality or user experience, putting money, time, and — in the case of hospital networks — even people’s lives and well being at stake.

While fields such as gaming, robotics, and natural language processing have been relatively receptive to unsupervised approaches, large online systems have been limited to supervised approaches. Supervised machine learning poses fewer immediate risks to the system, but is incapable of learning the sequential decision making needed to maximize long-term rewards.

This was the dilemma faced by Alibaba when looking to use machine learning to improve the commodity search function of their e-commerce platform Taobao. Their solution was to simulate the live-system testing environment by building “Virtual Taobao” — a like-for-like replica of the platform, complete with virtual users created from real historical data. Results so far have shown that the simulation can train significantly better search engine policies than the previously used supervised learning approaches, while real-life Taobao is shielded from the adverse effects of training.

The idea of using simulations to provide a realistic but safe training environment for reinforcement learning is not completely new. Google, for instance, applied this approach for its data center cooling facilities, using neural networks that approximated the real system’s dynamics.

Like Google’s cooling facilities, Taobao cannot afford to be exposed to unpredictable live testing even for a short time. During normal operation, Taobao’s search engine deals with millisecond-level responses to billions of commodities, sorting through potential outputs to generate a page view (PV) to show the customer. The customer’s subsequent behavior then provides a feedback signal, such as making a purchase, moving to the next page, or leaving the site. Based on the results generated by a given PV, the search engine will update its decision policy for that scenario, contributing to the overall evolution of the system’s strategy for displaying PVs.

Taobao search in engine view and in customer view

Rather than risk impacting shopping experiences on Taobao, Alibaba applied two adapted simulation frameworks and real historical Taobao data to create a parallel platform closely resembling the original. The developers first adapted a special generative adversarial network (GAN) to differentiate between real and simulated customer behavior inputs. The resulting generative adversarial network for simulating distribution (GAN-SD) enabled them to feed the search engines a more realistic pool of simulated searches and result responses than would ordinarily be possible.

Having set up a desired customer distribution, the team then used a multi-agent adversarial imitation learning (MAIL) mechanism to train simulated customer policies and engine policies against each other in a zero-sum game framework, ensuring that the customer policy would be generalizable for different engine policies.

With the introduction of the GAN-SD and MAIL simulation tools, Alibaba was able to imitate the spontaneity of real-time Taobao activity while deliberately training its engines to do better in such scenarios. Based on empirical measurements of total turnover, total volume, and rate of purchase by page, reinforcement learning on Virtual Taobao demonstrated a 3% improvement in strategy over traditional supervised learning methods, with a better generalization ability over time than pure behavior cloning approaches to simulation.

The customer distributions between Taobao and the Virtual Taobao
The R2P distributions between Taobao and the Virtual Taobao

These results suggest that simulation may be a useful means of applying reinforcement learning in other situations where complex physical environments have traditionally prohibited direct application.

The full paper can be read here.

Alibaba Tech

First-hand and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!