Enhancing Capacity Planning through Full-scale Stress Testing

The Alibaba Tech Team’s “nuclear weapon” in testing technology

Image for post
Image for post

Capacity Planning in PRINCIPLE

Capacity planning for Double 11 aims to answer two key questions: What is the projected traffic volume during the event, and how many machines will be needed to support that volume? Answering the first question is a straightforward case of using prediction algorithms and looking at historical data.

Capacity Planning in PRACTICE

To address this problem, the Alibaba tech team introduced “full-scale stress testing” as an additional stage in their capacity planning process. This key step simulates the same business scenario and traffic volume as Double 11 across the whole platform, painting the team a more realistic picture of capacity requirements. Since traffic can still fluctuate unexpectedly during Double 11, the tech team also developed traffic control mechanisms to mitigate the problems that arise at peak capacity.

Per-system Machine Capacity Estimation

Per-system machine stress testing can be achieved in four ways: request simulation, request replication, request forwarding, and load balancing adjustment. Each of these methods fit the needs of specific scenarios, but are also accompanied by certain drawbacks.

Simulation

Relatively easy, query simulations can be produced through open source or commercial tools like Apache Bench, Webbench, http_load, Apache Jmeter, and LoadRunner, and are better performed on unlaunched or low-traffic systems. This is due to the impact of error margins between simulated and real requests on the stress test’s structure, and the potential pollution of backend stores of data.

Replication

Though it picks samples from an actual operating environment, query replication runs the risk of pollution, and requires the copied requests to intercept pings through a specially earmarked machine, making it ideal only for systems receiving fewer queries.

Redirect

On the other hand, diverting and forwarding queries from distributed systems to a single machine increases traffic without the use of written queries and provides highly accurate test results with no data pollution; this is also the most commonly used method in Alibaba. Convenient as it may be, it does require a significantly large volume of queries, without which it cannot determine the precise bottleneck values.

Load balancing

In load balancing, a designated machine in a distributed environment issues more requests, but with the weight of the load balancing device calibrated. This produces accurate results and no data pollution, but as is the case with request forwarding, it requires a large quantity of queries (within a distributed system) to be effective.

Image for post
Image for post

Full-scale Stress Testing

The team has learnt the hard way that the rough calculations provided by per-system machine stress testing may be a good starting point, but does not guarantee performance during high-traffic events. Back in 2012, when the clock struck twelve on the eve of Double 11, many system operations performed worse than anticipated. The actual volume of online users and transactions was much higher than expected, and the interdependence of the systems worsened the issue. Faced with error pages, many customers were ultimately forced to abandon their carts.

Traffic Control

Capacity planning, even when founded based on a meticulous, precise business model, is predictive in nature and thus, prone to imprecision. A recent example of this would be the 2016 Double 11, where the capacity preparation for a peak value estimate of 142,000 requests per second was exceeded by almost 13% on the actual day of the event, which saw requests per second go over 160,000. Pushed to the limits of operating capacity, the machines processed requests slower, resulting in a less-than-stellar user experience causing a search request resubmission loop, ultimately resulting in shutdown.

Conclusion

The full-scale stress test is a watershed in backend readiness for Double 11 and, along with flow capacity control, is an integral part of Alibaba’s preparedness arsenal. Through its full-scale stress test, the team was able to stabilize the system’s response to high-volume traffic, plan resource expansion more effectively, and drastically shorten recovery periods.

Alibaba Tech

First-hand and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store