Evolution of the Real-time Data Warehouses of the Alibaba Search and Recommendation Data Platform

Background

The real-time data warehouses of the Alibaba Search and Recommendation Data Warehouse Platform support multiple e-commerce businesses, such as Taobao (Alibaba Group), Taobao Special Edition (Taobao C2M), and Eleme. Real-time data warehouses also support data applications, such as real-time dashboards, real-time reports, real-time algorithm training, and real-time A/B test dashboards.

Value of Data

Typical Search and Recommendation Scenarios

Real-time data is used in a variety of e-commerce search and recommendation scenarios, such as real-time analysis, algorithm application, and refined operations by the target audience.

  • Large data volumes: data storage in the petabytes every day
  • Total entries in a single table: over 100 billion
  • High queries per second (QPS): peak write speed of over 65 million records per second (RPS)
  • Peak QPS: more than 200
  • High data flexibility, diversified analysis scenarios, high-frequency analysis with specified conditions, and multi-dimensional queries without specified conditions

Typical Demands for Real-time Data Warehouses

The following typical demands for real-time data warehouses have been summarized during the construction stage:

  • Grouping: For example, the GROUP BY clause is used in SQL statements to display industry metrics.
  • Multi-dimensional Filtering: The array field is used to filter attribute values in scenario filtering, user filtering, product filtering, and merchant filtering.
  • Aggregation: Real-time computing metrics such as SUM and COUNT_DISTINCT are aggregated based on details.
  • A/B Test: The real-time gap between the test bucket and the benchmark bucket is calculated by parsing the bucket fields in log tracking.
  • Specified Keys: To troubleshoot problems or check core merchant metrics, you need to specify the merchant ID or product ID to query real-time metrics, and aggregate data based on the ID field in the real-time details table.
  • Unified Batch and Stream Processing: Real-time data warehouses only retain data generated in the last two days. Therefore, if you need to compare data on a year-on-year or month-on-month basis, you need to read offline data and real-time data for associative computing. This allows product or operation personnel to intuitively view the real-time data and contrast data in the upper-layer reports.

Real-time Data Warehouse Architecture

Based on these typical demands, we have abstracted the typical real-time data warehouse architecture, as shown in the following figure.

1.0 Real-time Data Warehouse Architecture

The 1.0 real-time data warehouse architecture consists of three phases, as shown in the following figure.

Data Collection

At the data collection layer, the upstream data collected in real-time is divided into user behavior logs, product dimension tables, merchant dimension tables, and user dimension tables. Dimension tables are used because the different businesses do not store all their information in logs during tracking. If all the information is stored in user behavior logs, the business is inflexible. Therefore, dimension tables are used to store more information for businesses.

Data Processing

During stream processing, Apache Flink preliminarily processes real-time user behavior logs at the data processing layer, including parsing, cleansing, and filtering data, and associating dimension tables.

Data Queries and Services

In the 1.0 architecture, we used the Lightning engine to carry the real-time detail data output by Apache Flink, implement unified stream and batch queries based on Lightning, and then provide unified real-time data query services for upstream applications.

2.0 Real-time Data Warehouse Architecture

Hologres Best Practices

Using Hologres helps simplifying the structure of the 2.0 real-time data warehouse architecture, save resources, and implement unified stream and batch processing. We are still using this architecture. This section describes Hologres best practices in specific search and recommendation scenarios based on this architecture.

Best Practices in Row Store Mode

Hologres supports row store and column store modes. The row store mode is friendly to KV queries and is suitable for point queries and scans based on primary keys. The tables that are stored in row store mode are similar to HBase tables. Different tables store dimension information for different entities. Apache Flink jobs efficiently read dimension tables and associate them with entities in real-time streams.

Best Practices in Column Store Mode

By default, tables are stored in Hologres in column store mode. This mode is friendly to OLAP scenarios and is suitable for various complex queries.

Best Practices for Unified Stream and Batch Processing

Hologres supports ad-hoc queries and the analysis of real-time detail data. It also directly accelerates queries of MaxCompute offline tables. Therefore, we use this feature to implement unified stream and batch queries, that is real-time offline federated analytics.

High-concurrency Real-time Update

In some scenarios, you must write incremental data to the OLAP engine in real-time and update the written data.

Outlook

We hope to continuously improve the existing real-time data warehouses based on the Hologres engine in the following aspects.

Real-time Table Joins

Hologres supports joins between tables that contain tens of billions of records and tables that contain hundreds of millions of records. Such joins are performed in seconds in response to queries. This feature allows you to perform real-time join computing in the Hologres query phase to perform dimension table association that must be completed by an Apache Flink job in the data processing phase. For example, assume table 1 is a details table and table 2 is a user dimension table. In the query phase, the join operation filters the user dimension table and then associates it with the details table to filter data. Such improvements offer several benefits:

Persistent Storage

In the future, we will explore how to persistently store real-time computing results in common dimensions by using the computing and storage capabilities of Hologres.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Tech

Alibaba Tech

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!