The Search for Rave Reviews: Finding High-Quality User Content with AI
A new Alibaba AI solution helps community operators sort the wheat from the chaff and quickly find the best user-generated content to showcase interesting products
Honest testimonials from consumers are one of the most persuasive ways of selling a product. That is why so many online vendors will beg and plead with you to leave a positive review.
Even more persuasive for consumers is the content, both written and visual, shared by other buyers in content communities. Now a popular feature of many e-commerce platforms, content communities are a place for shoppers to share their experiences with each other on all manner of purchases, including cosmetics, sports, travel, household goods, and so on.
Onion Box is one such internet content community run by Alibaba’s Taobao. Its stated aim is to “share the good life and make life more enjoyable”, and it has indeed had a profound effect on consumption and general life for Chinese consumers. Naturally, the higher the quality of the content, the bigger the impact on the community. So, one of the core issues the Alibaba team faced when building Onion Box was how to extract high-quality content.
This was a difficult task. Because let’s be honest, most users’ photos do not look like this:
To clarify, the point here is not to judge Taobao users’ attractiveness, but to emphasize the importance of photo quality. It is easy to underestimate the amount we concentrate on the model and background, rather than the item we are considering purchasing, when we glance at a picture. For example, one’s favorable impression of the blue dress in the photo above has everything to do with the angle, focus, lighting, and graceful posture of the model, but relatively little to do with the dress itself.
This has major implications when showcasing items on content communities like Onion Box. While sellers do provide their own item photos from carefully staged photoshoots on the main product page, users much prefer browsing photos from people who actually bought and reviewed the product. Unsurprisingly, higher quality content is much more effective at attracting potential buyers.
The trouble is, most user content fails to meet the desired quality standards. Operators have to sift for hours through numerous images to find the precious few that showcase items well.
This sifting process now joins a growing list of e-commerce processes that Alibaba is solving with AI. This article introduces the problem faced by Taobao content community and the technical solution developed by the Alibaba tech team.
Photography, Sublime and Ridiculous
The often-stark contrast in photo quality between those provided by buyers and sellers has become something of a running joke online among users of Taobao and other Chinese e-commerce platforms.
Consider this jacket:
Or this skirt:
The humor is somewhat lost, however, on operators tasked with selecting high-quality user-generated content (UGC) from existing buyers to form the initial batch of content for the community. After speaking to these staff, the tech team found the problems they were facing could be summarized as follows:
· Poor-quality buyer photos
Less than 30% of buyers’ photos met quality standards for commercial use.
· Strict reviewing standards
Community operators have developed a long set of strict standards to ensure photos show products in the best light rather than do them an injustice. Photos and video clips selected must be aesthetically pleasing, stylish, well-lit, and well-composed. Additionally, the background must be tidy, the buyer’s face must be clearly visible, and there must be no obvious commercial or advertising intent in the photos.
· Heavy reviewing workload
Staff invest a lot of time and energy screening photos.
Owing to the complex criteria, low pass rate, and high workload, it was clear that UGC screening and curation was a problem that could be better solved with algorithms than with human staff.
The overall solution developed for mining quality UGC is as follows:
· “All UGC” refers to all review content uploaded on buyer shows that includes pictures or video clips.
· “Approved UGC” refers to the high-quality content that is ultimately approved.
· “Highlighted UGC” refers to the buyer photos recognized by the merchant.
· “Ordinary UGC” refers to the buyer photos that are ultimately rejected.
The core objective here is to mine high-quality, rich and diverse UGC. As shown in the above diagram, this involves a three-stage process to select qualified images and then weed out any unsuitable ones that slipped through the cracks:
· Quality assessment
· Dirty data filtering
· Irrelevant image filtering
Quality Assessment Model
When content community operators review photos from the buyer show, they decide whether or not to approve it based on a comprehensive judgement of the quality of both the pictures and the written content. This led the team to approach the task as an intuitive classification problem.
Choosing the features
First, the team drew up a list of statistically measurable features to describe UGC, including user features, product features, and feedback features. These features were used to estimate the quality of UGC using a gradient boosting decision tree (GBDT) model, which provided preliminarily proof of feasibility for translating the UGC quality assessment task into a classification problem.
The full list of features used is as follows.
· Text length
· Median number of buyer likes
· Number of video clips
· Average image dimensions
· Average number of likes for the product
· Buyer gender
· Total volume of UGC for the product
· Number of pictures
· Average number of buyer likes
· Total number of likes for the product
· Number of likes for this UGC
· Maximum number of likes for the product
· Price of the buyer’s mobile phone
· Total number of buyer likes
· Number of paragraphs of text
· Total volume of buyer UGC
· Number of products
· Buyer operating system
· Median number of likes for the product
· Buyer age
· Total number of products sold
Conversion into a classification problem
The team’s initial assumption was that by marking the data that passed through the review as 1 and the data that failed the review as 0, the task could be converted into a simple dichotomy problem. However, when testing the system, the best results were achieved when the task was dealt with as a trichotomy problem as follows:
· Data that passed the review was marked as 2
· Data that failed the review but was highlighted by the merchant was marked as 1
· Data that the merchant did not highlight was marked as 0
The reason being that in the dichotomy approach, the reviewer only reviews data highlighted by the staff and decides whether to approve it. However, there is a great deal of data not highlighted by the staff that still needs to be reviewed. Therefore, the trichotomy approach is more practical and performs better.
During training of the GBDT model, forecasting was performed based on the all UGC data, and the system marked a total of 4 million items as high-quality UGC. Upon examination, it was discovered that the accuracy rate was about 50%. The most common issue with photos incorrectly identified as high quality was that they were not aesthetically pleasing enough, even though they closely resembled realistic scenarios.
Introduction of semantic features
Feedback on the data from community operators reiterated the fact that very high quality UGC indeed was needed to truly inspire users to long for the lifestyle they associate with attractive buyer photos. The bottom line for staff was that they would rather go without UGC than go with something sub-par.
This led to the team ditching the existing training model in favor of one using a convoluted neural network (CNN), which was capable of more sophisticated picture quality evaluation.
After adding image features, fine-tuning was carried out with ResNet50, pre-trained with ImageNet. This proved much more successful than the original training model, improving the review pass rate by over 100%.
This time, feedback from community staff revealed that they were generally satisfied with the quality of UGC when the photos featured young women. However, this was partly because young women tend to showcase their purchases more often and produce higher-quality content in the first place. The system still had problems identifying quality photos for less popular or more specialized goods. Often, the selected photos were not up to aesthetic standards, or simply irrelevant.
The reason photos lacked aesthetic quality can be explained by the fact that CNN is better at capturing semantic information than aesthetic information. The next step, then, was to introduce aesthetic features.
Introduction of aesthetic features
To train the model on image aesthetics, the team made use of the AVA Database. First introduced by Perronnin F et al. in their paper “AVA: A Large-Scale Database for Aesthetic Visual Analysis,” the AVA Database is an database of over 250 thousand images, each containing semantic labels (“natural scenery”, “sky”, etc.), artistic style labels (“complementary colors”, “duatone”) and an aesthetic score from 1–10 based on ratings from tens to hundreds of people.
To develop a system capable of using this database to assess image aesthetics, the team adopted the brain-inspired deep network proposed by Zhangyang Wang et al. in Brain-Inspired Deep Networks for Image Aesthetics Assessment”. The network architecture is as follows:
The core idea is to learn the hidden layer image style features using the labels provided by the AVA data set, combine these features with the post-HSV (hue, saturation, value) transformation features of the image, and use the aesthetic scores from the AVA data set to provide aesthetic features for both supervision and learning images.
On this basis, the following network architecture was ultimately used in the UGC quality review model:
The model was pre-trained with the image style labels and aesthetic scores from the AVA dataset. Aesthetic features were extracted with the Brain-inspired Deep Network, semantic features were extracted with ResNet, and the original statistical features were depicted using the depth model. The system then performed comprehensive UGC quality evaluation based on all three types of features.
The introduction of aesthetic features improved the accuracy, recall, and F1 values of the model, leading to an improvement of 6% over the previous model in review pass rate.
Dirty Data Filtering
Based on feedback from operators and the team’s examination, the dirty data present in the UGC review was categorized as in table 1.
The team tried using the existing components to conduct sentiment analysis of comments. However, they found that identifying negative comments was difficult and many comments were mistakenly identified as negative.
Next, they tried training the system using the approved UGC comments and bad comments, which achieved an F1 value of higher than 0.9 on the verification set. However, due to differences between the training data and actual datasets — the presence of neutral comments, for example — in practice the system still mistakenly identified comments as negative.
To combat this problem, they re-trained the model using four comment types: approved, positive, neutral, and negative. This gave slightly lower F1 values on the verification set, but better effects in practice because the training data was more realistic. This solution solved the basic problem in that no further negative comments were seen in UGC selected by the model; however, mixed but generally positive reviews were still marked as negative.
For the final model, the team chose Attn-BiLSTM (two-way LSTM with attention), which achieved a 3% improvement in F1 score over TextCNN and was able to correctly identify the status of mixed reviews.
Routine and repetitive comments
Through a global comparison of UGC text, it was possible to identify templates used by multiple users and filter them out.
Repetitive comments were identified by analyzing the repetition of 2-gram, 3-gram, and 4-gram elements in UGC text, with text length and text information entropy taken into consideration.
Text-laden and generic images
OCR recognition and image editing recognition were used to filter out images with dense text.
For generic, pirated, or other Internet-sourced images, hash filtering was used. By representing the image as a hash value, generic images could be filtered out by calculating the number of global repetitions of the hash value across all buyer and seller content.
This process revealed that third-party images were even more common than expected in buyer show content. Even many seemingly original comments contained them. The majority of UGC was filtered out in this process.
Irrelevant Image Filtering
Last but not least, the team tackled the problem of filtering irrelevant images.
Hash value filtering was easily able to filter out the pirated images, GIFs, emojis, and other generic internet images which accounted for most of the images in the “irrelevant” category. This left about 10–15% of irrelevant images among the total images output by the system. Though a small number, this is a complicated category of images covering everything from random landscape shots to animated screenshots taken by users.
The solution adopted was as follows:
1. Use generic images as negative samples and approved images as positive samples.
2. Use ResNet to extract image features. Category features are extracted via embedding, while user features are extracted from user behavior (e.g. number and proportion of the published generic images).
3. Determine whether a given image is relevant based on the image features.
For Alibaba engineer Xiaohong (Ou Hongyu), the key lessons learned from this project were as follows:
• Data comes before features, and features come before models. The importance of using quality data that accurately models the real-life scenario cannot be overstated.
• Quickly tagging thousands of pieces of data is a good tactic when lacking desired data. The results might be better than expected.
• Fine-tuning an existing pre-trained model often leads to better outcomes on small data sets.
• Enhancing the data by flipping, rotating, and randomly cropping images can improve the generalization capability of the model.
(Original article by Ou Hongyu欧红宇)
 HeK, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[J]. 2015.
 KimY. Convolutional Neural Networks for Sentence Classification[J]. Eprint Arxiv, 2014.
 Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need[J]. 2017.
TalebiH, Milanfar P. NIMA: Neural Image Assessment[J]. IEEE Transactions on Image Processing, 2017.
 YuW, Zhang H, He X, et al. Aesthetic-based Clothing Recommendation[J]. 2018.
 PerronninF, Marchesotti L, Murray N. AVA: A large-scale database for aesthetic visual analysis[C]// 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2012.
 VozelB, Ponomarenko N, Ieremeiev O, et al. Color image database TID2013: Peculiarities and preliminary results[C]// European Workshop on Visual Information Processing. IEEE, 2013.
 Zhangyang Wang, Shiyu Chang, Florin Dolcos, Diane Beck, Ding Liu, and Thomas S. Huang. 2016. Brain-Inspired Deep Networks for Image Aesthetics Assessment. Michigan Law Review 52, 1 (2016).