The DataOps Files III: Data Synchronization
This article is part 3 of a 5-part mini-series on DataOps at Alibaba. This installment looks at the ways Alibaba has worked to enhance data synchronization for its big data platform MaxCompute, improving both the efficiency and the quality of these tasks and reducing entry barriers for inexperienced users.
For professionals outside the computing world and even many within it, it may seem strange to imagine working extensively with systems you have only a limited grasp of. In the quickly growing field of big data, though, this predicament has emerged as a daily reality for many who work with complex platforms to help fulfill the computing tasks underlying settings like online commerce.
While the Alibaba Group currently employs more than 2,000 data processing-related specialists, most rely on the group’s big data platform — a combination of DataWorks and MaxCompute — to effectively implement business logic, meanwhile remaining unaware of how the underlying data is computed in large volumes. In one famous instance of the havoc this risks, a select statement written by an unwitting intern led to the generation and display of a one-million yuan purchase bill for a customer of one of Alibaba’s shopping platforms. Faced with such liabilities, the group’s China-leading big data platform has continued to undergo repeated iteration to lower entrance barriers for new users while supporting business logic objectives and reducing storage needs.
One key area where these updates have proved vital is data synchronization. Big data platforms act as warehouses for data, as opposed to producers of it. This necessitates the injection of raw data to finish layer-by-layer processing, after which part of the processed data must return to the system where the service is applied for consumption. It is at this level that data synchronization becomes a crucial priority. Generally speaking, data is primarily sourced either from streaming real-time writes or direct synchronization imports from other storage systems. Taking log data as an example, log data is written directly into the data warehouse following mature log channel service or Blink real-time computation. A specific instance is how the billing data for a given day will be synchronized from the online trading library into the data warehouse at midnight of following day as part of offline reconciliation and settlement, as in the case of the erroneous purchase bill mentioned above.
In this article, we look more closely at the fundamentals of data synchronization, leading to a case scenario that technical audiences can apply to their own working knowledge of the big data field.
Operation and Maintenance Challenges in Synchronization Tasks
Alibaba’s big data platform was developed to seamlessly connect the group’s various internal storage systems, including RDS, OSS, TableStore, Oceanbase, HBase, and AnalyticDB. With rising business demands, data flow has become increasingly important, and synchronization tasks today consume millions of cores of CPU resources each day — second only to SQL.
In the past, Alibaba did not specifically optimize synchronization tasks, relying on temporary, inefficient remedies which impacted the company’s baseline and online services. Meanwhile, thorough results could not be achieved through optimization efforts based on practical experience alone, leaving many latent risks undiscovered.
Optimizing synchronization tasks requires a deep understanding and experience of the synchronization center. In most cases, this calls for consulting members of Alibaba’s platform development department or SRE, and it is rare for users to successfully complete initial optimization independently.
The followings two charts show the synchronization task efficiency of two business units with a demonstrated gap in their respective offline practice abilities (using synchronization speed as an indicator).
The above data reflects the fact that the offline work of the unit represented in blue achieved deep, sustained insight, with its synchronization tasks generally distributed in the higher speed range. The unit represented in yellow, by contrast, reflects performance at an initial stage of offline competency. Its synchronization tasks rely to a large extent on luck, and are distributed in a relatively even range of speeds due to the different efficiencies of different types of synchronization tasks. From the perspective of average efficiency, the gap between the two units is obvious. The former is 2.57 times more efficient than the latter, which shows the extent to which practical experience affects synchronization task optimization.
Nevertheless, gaps in offline practice will always exist. The best way to help business units improve their efficiency in synchronization tasks, or even support tasks that have been optimized manually from experience toward optimal efficiency, is with a data- and algorithm-based system using company Tdata. This system digitalizes experience, optimizes tasks, and makes tasks smarter. Its offline computation is not demanding on SLAs, and latencies at the minute level or even the hour level are acceptable.
Optimizing synchronization tasks should aim first of all to create a smart recommendation platform for synchronization task optimization, so that beginner users can complete optimal efficiency tuning for synchronization tasks on their own.
Beyond this, it should aim to capture potentially tunable synchronization tasks on the platform, complete the daily push of synchronization task tuning, promote users to better optimize their own tasks, and eliminate potential risks in all types of synchronization.
Key Application Elements and Methods
At present, data operation and maintenance still focuses on operation and maintenance, with data as a supplement. With input from R&D specialists from Alibaba’s synchronization center and Site Reliability Engineering (SRE), the following elements have been determined as important to the efficiency of synchronization tasks:
· Source type
· Sink type
· Record size
· Number of fields
· JVM parameters
· Batch size
· Error limit
In a synchronization task, the source type, sink type, record size, and number of fields are properties of the task, and are fixed variables (also called basic attributes); concurrency, JVM parameters, batch size, and error limit are configurable variables, which can be understood as parameters; the desired output result is the speed. On this basis, it becomes possible abstract a general problem and optimize the output by adjusting concurrency, JVM parameters, batch size, and error limit.
To determine a method for achieving optimization, Alibaba’s team sought input from the group’s algorithm experts.
K-means cluster analysis
Following a number of sleepless nights spent in exploration with algorithm specialists, the Alibaba team was able to establish the most suitable algorithm for synchronization optimization: cluster analysis.
In brief, the idea is to cluster a large amount of historical data (obtained by processing historical synchronization tasks) based on fixed variables (source type, sink type, record size, and number of fields), so that tasks in the same cluster by-and-large share the same basic attributes and can be treated as one task. Thus, tasks in a common cluster should have a configuration resembling the optimal configuration (concurrency, JVM parameters, batch size, and error limit).
The team used K-mean in clustering for each direction of transmission. Taking the task synchronization from MaxCompute to mysql as an example, the following figure shows the data yielded after standardization according to the attributes (record size and number of fields) of historical synchronization tasks.
The blue and yellow portions depict two clusters of jobs that are relatively alienated from the majority, with the distances between the dots likewise removed. Correspondingly, 90% of jobs are concentrated in the red portion. Therefore, a second clustering is needed in this case. For the first clustering, setting K=3 effectively separates the three clusters of yellow, blue, and red distribution points.
Due to the large amount of work done on the red portion, it becomes necessary to make a finer division among the red dots by performing a second clustering on it, this time by setting K to 20 and selecting the five clusters with the largest sample size as a representative cluster center, as shown in the figure below.
Optimal configuration: Customized recommendations
After completing task clustering, the next step is to further analyze the historical task configuration and the corresponding synchronization speed of each cluster, then select the optimal configuration combination for each cluster. Meanwhile, to ensure the efficient use of resources, multiple recommended configuration rules should be provided for each type of task, which can be flexibly configured according to the priority level of each task as shown in the figure below.
Taking the transmission direction from odps to mysql as an example, the 15_new, 19_new, 4_new, and 8_new in the table are clusters formed by clustering synchronization tasks from odps to mysql. In each cluster, 🚩🚩🚩and 🚩 indicate the recommended configuration corresponding to high and low priority tasks, respectively.
Thus, for a new synchronization task, the first step is to find the cluster it belongs to according to its source type, sink type, record size, and number of fields, and then obtain the optimal configuration recommendation according to its priority.
The results of random sampling show that the synchronization effect of the task changes significantly after adapting the optimal configuration. Generally, the synchronization speed increases as much as tens of times, as shown in the figure below.
As shown, there are 54 synchronization instances in this grayscale. Most of the synchronization efficiency displays a qualitative change, with speed increasing by several or tens of factors, peaking at a multiple of 217.
The following image compares synchronization speeds of these tasks before and after optimization:
Originally, synchronization took as much as tens of hours. Following optimization, tasks could be synchronized within a few hours, or even one hour. This greatly enhanced stable operation of related business processes.
Before optimization, most tasks were concentrated in an inefficient synchronization speed range of 5MB/s, with some tasks even occurring at speeds below 100KB/s. None achieved a high synchronization speed of 50MB/s or better. After optimization, most tasks fell within the high-speed ranges of 5MB/s to 10MB/s, 10MB/s to 50MB/s, and 50MB/s to 100MB/s, with some even achieving ultra-high speeds of 100MB/s.
The average speed before optimization was 2.28MB/s, improving by a factor of seven to 15.9MB/s after optimization.
Synchronization-task parameter tuning is among the most common application scenarios for DataOps-driven OPS optimization driven in offline big data services, in which optimal parameter configuration through data indicators and algorithm models is generally the best approach.
The first step of this process helps the service to solve historically inefficient synchronization tasks. The second step can be applied to the service platform as a configuration standard for applications to be launched online, automatically generating the optimal configuration recommendation while the user configures synchronization tasks and solving the synchronization efficiency problem from the source. In addition, improvement in the speed of synchronization tasks may bring about concentration of transmission traffic, resulting in high peaks in bandwidth usage. To alleviate bandwidth pressure, off-peak start-up time for synchronization tasks can be set according to their expected speeds.
In our next article, “The DataOps Files IV: Smart Resource Allocation”, we will introduce how to make your resource utilization more efficient, your resource isolation more stable, and your resource allocation more flexible.