Foolproofing Live Streams in China

5 min readMar 28, 2018

How object detection algorithms are helping ensure compliant live streaming on Alibaba’s shopping platforms

Though live streaming has been popular for a while, 2016 was the year it skyrocketed in popularity, with new live streaming platforms and viral live streamers emerging every week. The trend stabilized in 2017 with the steady growth of platforms like Taobao Live Streaming (Alibaba’s ecommerce-related live streaming service), and a greater awareness of the risks associated with the increased volume of live videos — the most common being copyright violations and prohibited content.

The former is a major pain point for live streaming platforms, since protecting intellectual property rights relies heavily on manual supervision of live video content. Online content providers are compelled to adhere to domestic and international laws on intellectual property, making this an inescapable obligation which carries heavy penalties for non-compliance.

Another compliance issue is enforcing vulgarity or anti-indecency laws, which vary across jurisdictions. Though lax in certain countries, others may severely punish platforms for allowing live streams that violate socio-legal norms. Though Taobao’s pornographic-image recognition model can easily flag nudity, detecting subtler violations under region-specific indecency laws is more complex.

To combat the uncertainties present in moderating a highly-fluid and dynamic content format like live streaming, Alibaba developed a tool to help monitor live streamed videos in real time.

Alibaba’s Technical Solution for Object Detection

When a live stream begins, the live stream is differentiated by the live room ID, after which a specific cache is initiated. In the live streaming process, the bottom-up feature is calculated for the frame corresponding to the ID and is compared with the cache’s contents. Detection result are read directly for matches; for non-matches, the target detection cluster is called in for calculations, the results of which are written synchronously to the cache.

Keeping in mind the response time and concurrency inherent in the live streaming business, a global image deduplication module was designed to remove duplicate frames in live streams by matching features, significantly reducing the workload for complex backend models and keeping algorithm processing times at millisecond levels.

The decision model makes intelligent decisions based on different outputs of different models, and returns risk categories and response suggestions.

Object Detection Network

1. Using the base network to extract features from the whole image

The imagenet’s feature extraction structure and image classification tasks are similar to that of a convolutional neural network (CNN). Commonly used whole image feature extraction networks include VGG (baseline), ResNet series (features with high characterization capabilities), and Mobilenet (small model). Application models frequently feature a trade-off between accuracy and efficiency.

2. Multi-scale and rendering a new experience for extra-network adaptable objects

The multi-scale approach applies to traditional feature extraction methods such as object detection methods based on bottom-up features like SIFT and SURF. Feature extraction costs run too high when using existing methods, which is why multi-scale approaches are no longer used on original images.

Final layer features have the best semantic characteristics and scaling generalization abilities when a CNN network’s image representation abstraction capabilities are employed. But as far as the larger visual receptive field is concerned, pixel-level representation capability remains limited, which has a negative impact on positioning.

This can be understood as the transfer of downstream operations to the convolution layer while high-level semantic features and low-level pixels are used for representing features. Context information between feature layers is added to achieve a more accurate representation of features.

3. Regression-based outside frame positioning of reproduced content

4. Fine-grained recognition of reproduced content and constraints on confusable samples by ROI network

5. Demo display effects:

Service Deployment

The target detection service’s CNN model is deployed in the VPC environment’s P100 cluster to flexibly extend capacity based on traffic. Different VIPs perform query isolation in different areas and businesses to ensure traffic compartmentalization and successful remote recovery. As a middle layer in service scheduling, Insight supports multi-model synchronization modes and asynchronous scheduling, and performs functions like refined streamlining restriction and monitoring. The consumer-side is found at different application layers, and Taobao live is at one of them.

1. Docked deployment of VPC environment on the cloud, thus reducing expansion time costs.

2. Group isolation for VIPs to ensure stable services for Double 11 clusters.

3. Cross-domain schedules and cloud resource utilization.

4. Cross-domain internal network demos, multi-model combinations, and online services.

Industry analysts unanimously believe that video and video-based services will see the most growth in the coming few years. Live streaming, though relatively nascent, is thriving across countries and audience groups, and cannot be audited in methods used for static video uploads. Thus, it is imperative to develop security technology applications for effective moderation of live streaming content. Through its detection network solution, Alibaba ensures compliance with local and international laws and provides a safe and stable business environment for live streaming services.

(Original article by Jin Xuan金炫)

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook

Foolproofing Live Streams in China

Alibaba’s Technical Solution for Object Detection

Object Detection Network

Service Deployment

Alibaba Tech

Written by Alibaba Tech

No responses yet