This article is part of the Academic Alibaba series and is taken from the paper entitled “Impression Allocation for Combating Fraud in E-commerce Via Deep Reinforcement Learning with Action Norm Penalty” by Mengchen Zhao, Zhao Li, Bo An, Haifeng Lu, Yifan Yang, and Chen Chu, accepted by IJCAI 2018. The full paper can be read here.
The rise of e-commerce has changed the way people shop, putting everything customers could possibly need (and some things they probably do not) at their fingertips. But, with the wide range of products available, how do e-commerce platforms determine which products to show customers?
The simple answer is popularity. When a buyer searches for a keyword on an e-commerce platform, the website retrieves related items and arranges them according to the probability that the customer will purchase a given item. In effect, platforms funnel buyer impressions to products most likely to make money.
Faking It Till They Make It
While buyer impressions can be legally obtained through methods such as advertising, some sellers have found a way to fool the impression allocation system: fraudulent transactions. These sellers control a number of buyer accounts and use them to purchase their own products, artificially inflating their own popularity and increasing buyer impressions.
These fraudulent behaviors compromise the effectiveness of impression allocation and the business environment of the platform. Alibaba’s tech team, in collaboration with researchers from Singapore’s Nanyang Technological University, has developed a method to simultaneously maintain a sales platform’s profit and reduce fraudulent sales.
A recent line of works introduces deep reinforcement learning to e-commerce mechanism design. However, these mechanisms have little use in real-world settings due to the massive scale of most e-commerce platforms. Their approaches cannot effectively scale up to accommodate an environment with potentially millions of sellers. Additionally, these approaches apply the deep deterministic policy gradient (DDPG) directly with a softmax layer, the final output layer that performs multi-class classification, making the allocation of buyer impressions smoother. In practice, however, the products on the first few pages account for the most buyer impressions, and the distribution of buyer impressions is quite sharp.
Taking Action against Fraud with ANP
Building upon the DDPG framework, the research team developed a modified algorithm called DDPG-ANP, or DDPG with action norm penalty. In this method, the norm of the agent’s action is included in the reward function to facilitate learning in an unbounded action space. This also eliminates the softmax function of DDPG.
In experimental settings, the actor network in DDPG had substantially more parameters and required a longer training time compared to DDPG-ANP. This is due to the former’s high-dimensional action space whereas the latter’s action space is fixed.
Not only does DDPG-ANP outperform DDPG and heuristic approaches in terms of scalability and solution quality, but it also combines fraud prevention and buyer impression allocation, improving e-commerce platforms for both buyers and legitimate sellers.
The full paper can be read here.