This article is part 4 of a 5-part mini-series on DataOps at Alibaba. This installment looks at how Alibaba’s Tdata platform improves resource utilization, isolation, and allocation in different applications.
In today’s data age, product and service usage data and statistics can provide great insight into what works and what doesn’t in terms of product development. In essence, the DataOps methodology is used by tech companies to align the way in which data is managed with the specific goals that they have for that data for collaboration between necessary teams, including data scientists, engineers, and technologists.
Research on statistical modeling, machine learning, and overall optimization can help products meet business needs, balance resource allocation, and maximize cost-savings. These can be used to achieve more efficient resource utilization, more stable resource isolation, and more flexible resource allocation.
This fourth article in the DataOps series focuses on capacity management, introducing the issues with existing allocation management and how to solve the smart allocation of resource groups of large-scale offline computing platforms and algorithm platforms.
Resource Allocation Management Modes
In the past, the resource management of MaxCompute and PAI used the traditional “planned economy” form, by which the allocation of resources mainly depends on the budget application submitted by each quota group. This management mode has the following problems:
· Insufficient flexibility to smooth the peaks and troughs of resource usage between quota groups, resulting in wasted resources.
· When a failure occurs, the operation and maintenance personnel can only decide to add resources randomly. This may cause several rounds of repeated adjustments, easily leading to failure escalation.
· Lack of metrics for user experience and satisfaction, thus resources cannot be managed flexibly.
· Lack of a rolling forecasting mechanism for the overall resource usage. Due to the generally long cycle of machine procurement, insufficient arrivals may occur.
In response to these problems, the Alibaba tech team have established a set of resource management models based on operational optimization, time series forecast, and elastic resource allocation according to user satisfaction. These models can be applied to achieve more efficient resource utilization, stable resource isolation, and flexible resource allocation.
Tdata Resource Management Model and Algorithm Framework
Tdata is an algorithm-and model-integrated management platform developed by the Alibaba Big Data Infrastructure Engineering Team for typical operation and maintenance scenes. The current Tdata underlay mainly covers operational optimization models, machine learning models, and statistical analysis models. Meanwhile, based on these models, some solutions for operation and maintenance scenes, such as anomaly detection and resource consumption time series forecasts, are further encapsulated.
Based on Tdata’s models, algorithms, and solutions, the Alibaba tech team have built a resource management module for big data products, the architecture of which is shown in the following figure:
User Satisfaction Model
The user satisfaction model is a set of total evaluation systems comprising sub-divided indicators of multiple dimensions such as user resource preemption. User satisfaction is the benchmark target, evaluation criterion, and feedback indicator for follow-up resource regulation. The adjustment of resources must maintain user satisfaction at the established service level.
All satisfaction indicators are time series data. To better monitor user satisfaction, the team called on Tdata’s time series anomaly detection module to perform automated anomaly detection and actively identify points that deviate from the original mode, as shown in the following figure.
Optimization of Quota Group Resource Allocation
This is a series of optimizations focused on forecasting resource usage, ensuring even resource usage, guaranteeing resource availability and setting usage limits.
For each quota group, based on the resource request volume data at the hourly granularity for the past four weeks, the Alibaba tech team compared the exponential smoothing, ARIMA, and EWMA models. Eventually, the exponential smoothing model was selected to forecast the resource consumption for the subsequent week as the empirical results proved that the model performs better in terms of short-term forecasting.
Exponential models include three different model types:
· Single-exponent model
This fits the time series with only the horizontal term but no trend term or seasonal effect. It forecasts the future based on a weighted average of the existing time series.
· Double-exponent model (Holt model)
This fits the time series with the horizontal term and trend term.
· Triple-exponent model (Holt-Winters model)
This fits the time series with the horizontal term, trend term, and seasonal term.
Considering that the weekly resource consumption has certain periodicity, the Holt-Winters model was chosen to forecast the quota group resource guarantee and the optimization setting of the upper limit.
Based on the team’s resource consumption forecast results of quota groups for the subsequent week, the data distribution of the resource request volume that each quota group must satisfy was obtained. By combining this with the historical user satisfaction data of the quota group and the SLA service at the user level, the team was able to work out the resource recommendation value for each quota group.
Resource Procurement Forecast
This is a short-term forecast of resource consumption when setting the resource guarantee value and upper limit value for each quota group. Due to the long cycle of machine procurement, a longer-term rolling forecast of resource consumption of the entire pool had to be made. The relevant forecasting models and principles for this has been referenced in the second part of the DataOps series.
Application: ODPS Public Cloud Resource Management
Resource allocation within the MaxCompute Group (inside the capsule) is largely related to the budget of each BU, which is significantly different from that of the public cloud. Since the public cloud is a service for external sales, it naturally carries more “market economy” features. As shown in the figure below, the public cloud includes two payment modes: prepaid and postpaid. In postpaid groups, different user levels are defined according to the sales situation. These features make the resource management of the public cloud more flexible, allowing more application space for the team’s resource management model.
The Alibaba tech team gradually made trial resource optimizations and adjustments on the post-paid general groups of the MaxCompute public cloud. Under the circumstances that the user resource request volume is substantially stable, the trial optimizations and adjustments have greatly saved resource consumption and cost and the overall utilization rate of clusters has significantly improved.
Application: PAI Resource Management
PAI’s resource management is similar to ODPS Group’s internal resource management. However, the procurement cycle of PAI’s GPU resources is longer and the business needs to change greatly. As a result, the previously submitted budget does not really utilize the resources, causing a low overall resource utilization rate of clusters. Therefore, a more flexible resource allocation strategy must be used to improve resource utilization.
The allocation strategy has been used online for one month. On the basis of not affecting the user’s execution of algorithm tasks on the platform, the allocation strategy has greatly saved the resource consumption and cost, and the overall utilization rate of clusters has significantly improved (as shown below).
Through the knowledge of statistical modeling, machine learning, and overall optimization, the outlined research can actually help the product meet business needs, ensure the stable capacity level of the cluster, balance resource allocation, and maximize cost-savings. These can be used to achieve more efficient resource utilization, more stable resource isolation, and more flexible resource allocation.
Moving forward, the Alibaba tech team are still continuing to carry out more refined resource management, making a more reasonable distinction between users of different levels and matching the user satisfaction model to make it more in line with the real user experience. The team is also promoting the resource management model to a wider scope, making full use of the scale of the platform to smooth peaks and troughs, and further exploring ways and practices to improve resource utilization.
In our final article of the series, “The DataOps Files V: Anomaly Detection”, we will explore the statistics and methods operations personnel rely on to detect outliers and direct their maintenance efforts.