Unlocking Insights from Multi-Round Searches with Reinforcement Learning

4 min readMar 20, 2018

Alibaba’s new approach for ranking search results

This article is part of the Academic Alibaba series and is taken from the paper entitled “Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application” by Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. The full paper can be read here.

Learning to rank (LTR) methods have been widely applied by E-commerce platforms as a solution to ranking search results. To date, such methods tend to consider the different steps in a user’s search session to be independent of each other. However, research by the Alibaba tech team in collaboration with Nanjing University indicates that the different ranking steps in a session are in fact closely correlated and this is an important factor for ranking that, until now, has not been fully investigated.

Search engines are a fundamental tool for E-commerce platforms such as AliExpress or Amazon. It is the first tool users turn to in order to search for items to purchase, browse product information and make comparisons. After a user inputs a query, the search engine must be programmed to rank the results in an order that is suitable for the user. In practice, this involves assigning each item in a search result a score and sorting the items accordingly.

Typically, ranking the items on a search result page is a standard multi-step decision-making problem.

1. The user inputs a query to the search engine.

2. The search engine ranks the items related to the query and displays the top items.

3. The user takes an action, such as clicking an item, buying an item, requesting a new page of the same query, or ends the search session.

4. If a new page is requested, the search engine re-ranks the remaining items and displays the top items on a new page.

These four steps repeat until the user makes a purchase or exits the search session. The different actions that a user takes during the search session can indicate personal preferences for different items and these preferences can be utilized to develop a ranking function which satisfies the users’ demands.

Naturally, this makes LTR methods ideal candidates for incorporating machine learning, to create systems which learn a ranking function by classification or regression from training data. Recently, online learning techniques such as regret minimization have been introduced into the LTR domain for directly learning from user signals (such as the user’s actions taken during a search session). This online-based LTR has some advantages over offline training, namely because it avoids the mismatch between manually curated labels, user intent, and the expensive cost of creating labeled data sets.

Nonetheless, the majority of these methods model the interaction between the search engine and each user as consisting of only one round of ranking-and-feedback activity. In practice, though, data shows that a successful transaction typically involves several rounds of this multi-step process.

Incorporating this insight in to their research, the team showed theoretically that maximizing accumulative rewards is necessary, in turn indicating that the different ranking steps in a session are tightly correlated rather than independent.

This led the team to suggest a novel reinforcement learning (RL) algorithm to determine an optimal ranking policy which maximizes the expected accumulative rewards in an individual search session. This novel policy gradient algorithm is able to deal with the problem of high reward variance and unbalanced reward distribution of a search session Markov decision process (SSMDP).

Tests of the algorithm in simulations and in real-life scenarios showed that it performs much better than online LTR methods in the multi-step ranking problem. The algorithm achieved a 30% and 40% growth in gross merchandise volume (GMV) for the simulation and real-life scenarios respectively.

Read the full paper here.

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook

Unlocking Insights from Multi-Round Searches with Reinforcement Learning

Alibaba Tech

Written by Alibaba Tech

Responses (1)