The Evolution of Online Analytics from Alibaba Economy Ecosystem to Cloud
By Chaoqun Zan, Researcher of the Alibaba Cloud; Julian Zhou, Staff Product Manager of the Alibaba Cloud
Interesting Data Products in Alibaba
Analytics is all about data. How to discover the value in data, and how to turn the data into business value is the core logic question for analytics development. Data driven analytics products have a long history in Alibaba group. First, let’s take a look at some interesting data products in Alibaba group.
Taobao time machine (https://www.jianshu.com/p/4c0d8db26b80) is actually the first “to C” internet big data application in China since 2012. It is a kind of application, when every single person logs in, it is just like a time machine to show all of the personal behaviour, interest, action, characteristics in multiple dimensions as profile in a timeline. It is full of personal memory, and with very good effect, which moved lots of Taobao users that year at 2012. This kind of customer oriented data product set up a role model, and a bunch of “to C” applications have done and are doing the similar stuff every year for customers as a summary of the whole year’s footprint.
Alibaba Index (https://shu.taobao.com/) was a market oriented application since 2011. The original name is Taobao index. It was telling the trending on Taobao or the Alibaba e-commerce in multi-dimension in terms of market, subscribers, categories, etc.
What is more important that how the analytics empowers the Alibaba business as a data driven engine. Data was growing exponentially with log. Data scientists and data analysts run algorithm every day to produce profile and tags on every single object including seller, buyer, product and order, etc. There 2 key principles, one is that the tag matching is efficient to find the similar object or similar crowd of objects.
Another one is that gradually weakening of the tags. That’s usually the leading tags take the most of the weight of the similarity.
These 2 principles make the really significant fundamental for the precision marketing and advertising nowadays. There is a logic called “LOOKALIKE” modeling internally, which is to find the target crowd based on the existing crowd in terms of similarity. So these are the steps, we set up the relational model; apply the algorithm and training model on the data; and use SQL query language to analyze and find the insights from the data online.
Based on these analytics scheme, Damopan (https://dmp.taobao.com/) and Alimama (https://www.alimama.com/) as the precision marketing and advertising platform are providing data driven analytics services and products empowering both Alibaba Group itself and merchants on Alibaba e-commerce ecosystem with incredible ROI everyday.
Data Business Development in Alibaba
If we take a look at the years of data and business development in Alibaba, we will see that data and business are correlated tightly. At the very beginning, business is generating data; then we enter the phase of how data is driving business growth; finally data itself is becoming as a business to make profit.
From the operational perspective, data business is about how to store, connect and use the data in more efficient way. First we move data to cloud. Data on cloud has a bunch of benefits, such as centralized store with unified metadata; cloud infrastructure can be utilized with large scale computing capability on top of data. Cloud data is being used by many data consumers as data assets. Then those data itself is becoming as a business driving precision marketing, FinTech and smart logistics.
With more openness, data and analytics on top of data became the data services. UMENG (https://www.umeng.com/) is such kind of data product nowadays. And it was the time our online analytics products and services go to public cloud such as AnalyticDB (https://www.alibabacloud.com/product/analyticdb-for-mysql).
To make the system and service architecture more agile, business and data Mid-End are born as middle layers sitting between business data application layers and the analytics platform, which is to drive business development and innovation in a quick manner with these 2 key factors, business and data.
Mid-End, we also call Middle Platform. Data Mid-End drives more data intelligence such as city brain, industry brain, etc. as shown below.
Here are some screenshots about the city brains that we’ve deployed in some scenarios like city transportation, city operation and energy control and operation. At the backend, data and the analytics on top of the data empower those kinds of intelligence, optimizing the city operation with better efficiency.
Data Middle Platform
Another data intelligence case is the Data Middle Platform (data Mid-End) for Alibaba economy as just mentioned before. Customer analytics and marketing, data assets help finding the target audiences in E-commerce, media asset like Youku, Weibo, longtail sites and offline media like UMENG, other digital media like Douyin (Tiktok) based on the “LOOKALIKE” logic introduced before. At the bottom, there are the supporting data and analytics platform and technologies including QuickBI (https://www.alibabacloud.com/product/quickbi), Quick Audience (https://dp.alibaba.com/product/quickaudience), Dataphin (https://dp.alibaba.com/product/dataphin), AnalyticDB (https://www.alibabacloud.com/product/analyticdb-for-mysql), OSS (https://www.alibabacloud.com/product/oss), etc.
Online Analytics System Development in Alibaba
Let’s take a look at the online analytics development from the technical perspective.
Alibaba was established in 1999, as the business and data growth, data analytics was always needed. At years before 2008, Oracle RAC was used, it is basically in SMP architecture , which is symmetric multi-processing with multiple cores in a single machine. This was good to handle the analytical workload for a period of years in terms of real-time, consistency, agility, accuracy. (Here agility, we define it as the latency to handle the multi-dimensional data analysis). It was “all-in-one” without system horizontal scaling out capability.
Alibaba business grew pretty fast, the system limitation show up due to fast growing query workload high with concurrency and data volume.
Then in 2009, Greenplum with MPP architecture was introduced to solve the problem especially for data volume, which could handle peta bytes of data, as well as the agility and accuracy. But there were still some limitation, which is high concurrency for mixed query types of workload and high availability. Since Greenplum is single node point of failure of the leader node. And data ingestion performance is not good enough for the realtime write scenario.
So, in 2011, open source big data projects were more and more popular in the market and internet giants for vast analytical processing. Alibaba also started to adopt HBase and Hadoop for peta bytes level of batch data processing, and the sharding architecture on top of MySQL databases instances with SSD storage replaced Oracle for the OLTP (online transactional processing) workload. This architecture was typical with decoupling online and batch data processing, transactional and analytical processing to handle the fast data growth. But the consistency between the batch and online, and also the agility were a headache. Usually, we need some architecture design and implementation for these 2 problems. Such as batch loading and streaming for data synchronization; pre-computing and more storage of data cube for agility. System was going more and more complex and high learning curve for new engineers joining the organization.
Then in 2013, AnalyticDB version 1.0 was born with solving some of those pain points including volume, high concurrency, agility, low latency, accuracy and high availability. But at the beginning, batch and online consistency was still hard as without realtime data ingestion and high concurrent analytical query within a single analytical system. Also the database ACID property was not supported yet in AnalyticDB, which was important for “all-in-one” architecture for an analytical system.
Even with these issues, thanks to the powerful online interactive query capability over big data, AnalyticDB was still becoming the online analytics infrastructure for digital transformation for the most of the business units in Alibaba group since 2013.
Evolving to Cloud
During the years from 2013 to 2019, Alibaba core businesses has experienced a journey of migrating to cloud infrastructure. AnalyticDB MySQL version 3.0 , AnalyticDB PostgreSQL 6.0 and Data Lake Analytics are the online analytics products evolved during this period of time, which are carrying the mission of helping customer’s business growth with the online analytics capabilities developed within Alibaba group during the past few years.
Cloud is a very open platform for data ecosystem. Cloud native Data Lake Analytics (DLA) can be used to connect and process data from different data sources, including data in OLTP databases, ubiquitous log data, massive data in the big data system Hadoop, and heterogenous types of data in OSS data lake including structured and semi-structured data. DLA integrates the open source Presto and Spark engines to achieve rich multi-source upstream and downstream connection capabilities and multi-mode computing capabilities.
Cloud native data warehouse AnalyticDB focuses on two aspects,
- cost-effectiveness, which provides users with high-performance online analysis capabilities at low data storage cost and on-demand elastic computing capabilities;
- ecosystem compatibility, which means MySQL users can choose AnalyticDB MySQL, while PostgreSQL users can choose AnalyticDB PostgreSQL. Thanks to this compatibility, it is convenient to use a variety of BI tool and suites to connect to AnalyticDB for online analytics and visualization. At the same time, it supports the use of data warehouse development tools such as DataWorks or DMS for data development and task scheduling.
How Is It Working on Cloud?
Thousands of customers have benefited from the online analytics cloud products since they were available on Alibaba Cloud. Let’s take a look at some application scheme in different scenarios.
A customer, who is providing short video marketing service, is using Data Transmission Service (DTS) synchronizing data from transactional database in RDS MySQL to AnalyticDB MySQL in real-time, which supports online analytics scenarios including hot video aggregation analysis, multi-dimensional BI reporting and statistics analysis of host and followers.
A customer, who is providing internet financial service, is using AnalyticDB PostgreSQL to support the analytics scenarios including real-time query for serving its clients and the BI reporting. The data ingestion channels coming from the upstream business system include,
- DTS synchronization from RDS MySQL
- Flink streaming
- Migration from Oracle
- Batch loading from MaxCompute via DataWorks (MaxCompute reader and AnalyticDB PostgreSQL writer) for query acceleration
A customer, who is providing social media App, is using AnalyticDB MySQL to support the analytics scenarios including BI reporting with Tableau and data analytics driven modules within its App. The data ingestion channels coming from the upstream business system include,
- DTS synchronization from PolarDB MySQL
- Periodically data loading from MongoDB via DataWorks (MongoDB reader and AnalyticDB MySQL 3.0 writer) for query acceleration
A customer, who is providing data service to its advertising clients, is using Data Lake Analytics as the centralized data processing engine in the lifecycle of advertising optimization as shown below.
A customer, who is providing gaming service, has built the gaming DAP (Data Analytics Platform) supporting all the analytical workload for gaming business. Data all sinks into 2 types of cloud storages including OSS as data lake and AnalyticDB MySQL as data warehouse,
- some application modules in ECS game server are writing data into AnalyticDB MySQL in real-time via JDBC connection;
- Log Service collects the application log via Logtail , and ships the log data into OSS data lake ;
- Data Lake Analytics does the ETL and writes the cleaned data back either to OSS or AnalyticDB MySQL;
- data from transactional database PolarDB MySQL and RDS MySQL is synchronized into AnalyticDB MySQL via DTS.
The online analytics system and technology are evolving fast on cloud in terms of cloud native, serverless, HTAP, intelligence, online/batch integration, and database/big data integration. From the data perspective, the journey from data driven business, data as business, data technology, to the data intelligence on cloud, the online analytics system is and will be the core engine driving the business growth to the future.