This article is part 2 of a 5-part mini-series on DataOps at Alibaba. This installment looks at how Alibaba has been researching DataOps technologies to find the perfect model for forecasting future resource consumption, allowing you to better manage your capacity.
If your computer cluster is no longer big enough and you need to buy new equipment, there are a number of factors to consider. Computers often take considerable lengths of time to arrive, and at Alibaba, materials are generally prepared three months in advance and orders confirmed one month ahead of time. The size of the purchase also directly affects the stability of the cluster and the utilization of resources.
In the previous article, which detailed an algorithm for project migration optimization, we introduced the use of linear programming as a means to migrate projects between clusters, a method that allowed you to fully utilize your current resources. In this article, the focus is on Alibaba’s Big Data SRE team, which is constantly exploring ways to improve resource utilization and reduce costs, notably through data mining. Doing so requires moving beyond the mere adjustment of current resources to accurately forecasting future resource consumption, in this case at the Alibaba Group.
Until recently, the forecasting method used in MaxCompute involved the average growth rate, which could be adjusted manually by operational experts. But this often resulted in errors and wasted resources, and considering all the communication between the procurement and finance departments, costs skyrocketed. It was therefore decided to integrate resource consumption data and establish a more scientific forecasting model that could reduce errors effectively.
Data — Difficulties and Challenges
On the way to optimization, the Alibaba team faced a number of challenges, centering on:
· Very little sample data
The team only started collecting sample data during the algorithm design phase between 2015 and 2017, with just over 700 samples available today.
· Long forecast times
Bearing in mind the act of procuring computers and waiting for them to arrive, the team has to forecast the average resource consumption for the next 30–60 days, 60–90 days, and 90–120 days. Of course, the longer the forecasting period, the greater the number of uncontrollable factors, so it becomes increasingly difficult to control errors.
· Noisy data
The data acquisition link is long and error-prone, which results in data anomalies. Meanwhile, historical resource consumption is mixed in with sudden surges caused by unexpected business growth. This data has become difficult to strip due to the lack of records.
· Other influencing factors
MaxCompute clusters provide computing services for all of the Group’s business units (BUs), and the resource consumption of each BU is closely linked to its own business development. This consumption can be affected by numerous factors, and data for these factors is difficult to obtain. During sudden surges in business, forecasting future consumption rates becomes more difficult.
Common Forecasting Models
There are currently two major schools of thought in the industry when it comes to forecasting models — traditional time series models and machine learning models (which includes deep learning models).
The traditional time series method refers to the act of forecasting future time series based exclusively on historical time series. Time series models mainly focus on the near future and tend to overlook th[e distant future. Of these models, the exponential smoothing and ARIMA ones stand out in terms of forecasting accuracy.
Machine learning models forecast by constructing and selecting featured data that may affect the forecasted values, and by then applying the linear regression and support vector machine models. To construct the featured data, you really need to understand the scenario you want to forecast, as the model being established is complicated.
In recent years, a number of deep learning models that are based on neural networks have been applied, including RNN and its variant LSTM. Although deep learning models are not very interpretable, their forecasting accuracy is high. However, due to their complexity, deep learning models require a large number of training samples, or else over-fitting becomes an issue.
Designing an Optimization Algorithm
Data on Alibaba Group’s resource consumption
The following two figures show the daily consumption of computing resources at the Alibaba Group starting in February 2015, and the average resource consumption for each month in different years.
From these figures, it can be gauged that:
· Overall, resource consumption is rising, but not linearly.
· Resource consumption seems to fluctuate greatly by weekday, but resource consumption by week shows a certain regularity. Resource consumption is also greater on weekdays than on weekends.
· Monthly data shows certain seasonal discrepancies. For example, resource consumption during China’s Spring Festival and National Day holidays is considerably lower.
Bearing in mind the Group’s resource consumption and the need to procure new cluster equipment, the Alibaba team sought answers to the following questions.
· Should the monthly mean be used for forecasting?
Due to the lack of monthly data samples, it is not possible to effectively compare the performance of a test set on different models. The procurement time may also vary as the actual situation varies. Therefore, a more flexible forecasting model is needed.
· Should resource consumption for 60, 90, and 120 days in the future be forecasted?
Due to the large fluctuation in resource consumption from day to day, it is difficult to make an accurate forecast for a random day in the future. Instead, a machine learning forecasting model was used to forecast the average resource consumption for the next 30–60 days, 60–90 days, and 90–120 days. A time series model was used to perform single-point forecasts and make an average of the results.
· How are seasonal discrepancies dealt with?
For the seasonal terms, seasonal decomposition is performed on the data to extract the monthly seasonal index. Then, the monthly seasonality is removed from the source data and the models are established. The models’ forecasting results are finally added to the extracted seasonal terms.
· Should a machine learning model or a time series model be selected?
It is difficult for any single model to fully capture all of the data characteristics, so it was decided that machine learning models would be integrated with time series models to get the final forecast.
The team’s overall approach is as follows:
· Time series decomposition and seasonal adjustment
Time series decomposition is a common method of analyzing time series. Time series with seasonal factors can be decomposed into trend terms, seasonal terms, and random terms. Trend terms mainly capture long-term changes, while seasonal terms capture periodic changes, and random terms capture changes that cannot be interpreted by trending or seasonal effects.
· Machine learning model
- Basic flow: The basic flow of our machine learning model is as follows.
- Feature engineering
§ Feature construction
To construct features, in addition to the resource consumption data itself, relevant storage and memory data was also added to the data source. Resource consumption during the project development phase was also taken into account. It was also noted that the fluctuations and growth rates of historic resource consumption, as well as the resource consumption of different task types, may affect resource consumption in the future. All of these factors were taken into account when extracting features.
§ Feature selection
Since many of the above features have a strong correlation, adding all of them to the model would cause redundancy, which affects the accuracy of the forecasting model. As such, statistics such as f_regression are used to filter the features, ending up with eight optimal features to be added to the model.
· Algorithm selection
In terms of machine learning regression algorithms, various algorithms were tested, including linear regression, ridge regression, random forest, and support vector machine. Optimal parameters for each algorithm were then selected using fivefold cross validation and GridSearch. The goodness of fit (R²) was finally used to evaluate the performance of each model on the test sets so as to find the best algorithm for this set of data. The Alibaba team found the performance of LinearSVM to be the best, with the R² of the three model test sets all above 80%.
· Time series model
For time series models, the forecasting only involved the daily resource consumption data (after data cleansing). First, single-point forecasts were made, after which an average was made for the calculation period.
- Exponential smoothing model
Exponential models are the most commonly used models for time series forecasts, but they are proven to be superior only in short-term forecasts. Exponential models include the single-exponent, double-exponent, and triple-exponent models.
§ The single-exponent model fits time series with only horizontal terms (no trend terms or seasonal terms). It forecasts the future based on a weighted average of the existing time series.
§ The double-exponent model (or Holt model) fits time series with horizontal and trend terms.
§ The triple-exponent model (or Holt-Winters model) fits time series with horizontal, trend, and seasonal terms.
Considering that weekly resource consumption has a certain regularity, the Alibaba team chose the Holt-Winters model to make its forecasts.
- ARIMA model
Autoregressive integrated moving average — or ARIMA — is a relatively complex model. The expression of its forecasted value is a linear function represented by p nearest real values and q nearest forecast errors. Its general construction process is as follows:
· Model integration
Since different models have different characteristics and advantages, the Alibaba team hoped to reduce the number of forecast errors by integrating different models and using the weighted average. To ascertain the weights of different models, the algorithm adjusts itself dynamically according to each model’s errors, so that the model that performs the best has a greater weight in the next forecast.
To compare the performance of different models and eliminate the influence of dimensions, the Alibaba team used the coefficient of variation of the root-mean-square error — or CV RMSE — to measure the relative errors of each model.
The average growth rate has long been the basic model for determining procurement solutions. For this reason, it was used as a control group for other models. This model calculates the average growth rate for the five months prior to a certain point in time, and then estimates the resource consumption for the next few months on the assumption that future consumption will grow at the same rate.
Comparing different models
Each of the models was trained with samples from two time intervals, applying samples from July 2016 onwards as test sets, and using the previously established models to perform out-of-sample forecasts. This helped the Alibaba team evaluate each model’s forecasting capabilities. Meanwhile, for the test sets, data from the anomalous Spring Festival period was removed.
The following table uses the forecasting model for the next 30–60 days as an example to show the CV RMSE of each model’s test sets. The smaller the CV RMSE, the fewer the forecasting errors. The fused model was found to work best, followed by the ARIMA, Holt-Winters, and machine learning models, all of which were better than the average growth rate model.
Applying the Models
In the end, the fused model with the fewest forecasting errors was used to guide the procurement of MaxCompute equipment. The forecast result for the next 90–120 days was used to inform the manufacturers to prepare materials, the forecast result for the next 60–90 days to adjust the purchase volume, and the forecast result for the next 30–60 days as the basis for the official order placement. If you go to the official MaxCompute website, you can find graphs that display the resource consumption forecasts for each of the three model groups, as well as the daily converted purchase volumes.
After carrying out research into the field of DataOps for years, the Alibaba team has realized that one of the better points of practice when it comes to data O&M is project cost control and performance optimization. These are areas of concern with businesses of a certain size, and are often beyond the code coverage of the projects themselves. But this is where data, models, and algorithms come in. With its comprehensive understanding of statistical modeling, machine learning, and optimization, the Alibaba research team can help businesses better control costs, optimize their consumption of resources, and improve overall performance.
In our next article, “The DataOps Files III: Data Synchronization”, we will examine how Alibaba has optimized data synchronization in order to improve the efficiency of tasks and lower entrance barriers for inexperienced users.