This article is part 5 of a 5-part mini-series on DataOps at Alibaba. This installment looks at how Alibaba is improving labor-intensive operations work throughout its ecosystem with enhanced anomaly detection methods.
In online systems, monitoring for data points that occur outside of expected patterns is essential to detecting fraud, intrusions, and other events that threaten networks. Known as anomaly detection, this practice has more recently become vital to the overall stability of operations, as the data generated in large online systems reaches enormous volumes.
At Alibaba Group, anomaly detection today extends far beyond mapping and alarm systems. As an engineering discipline that integrates data science, application engineering, process control, root cause measurement models, and machine learning, it encompasses the daily work of the operation and maintenance personnel responsible for handling emergencies and ensuring stability across the Alibaba ecosystem. To reduce their burden, improving available methods and strategies has become a key focus of Alibaba’s ongoing optimization work.
In this article, we look at the definitions surrounding anomalies and explore the methods supporting Alibaba’s increasingly automated monitoring and stability solutions.
Fundamentals of Anomaly Detection
In one definition, statistician D.M. Hawkins writes that an anomaly is “an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”
Within these parameters, anomalies may or may not actually reflect abnormal occurrences, making detection strategy a complex problem that is especially difficult to adapt into machine language and monitoring methods. Mainstream methods generally fall within the four categories of density-based, statistics-based, deviation-based, and distance-based detection. Across these, normal data is defined according to a known distribution, while anomalies are parts or points of an array that deviate significantly from this known distribution.
Because anomalies are relative values, detecting them is a matter of determining the motoring indicator they are defined against. Essentially, this is a curve of data points, with time-based historical data as one dimension and variable criteria to be monitored as the other. Treating the former as the basis for observation, anomalies can be understood as occurrences which have not appeared in previous historical data. However, the causes of these variations may not always reflect true anomalies, as for example when a routine system restart unnecessarily triggers an alarm.
Therefore, monitoring systems must be subdivided for the sake of follow-up procedures; retrieving information from a data set for detection purposes is formally called anomaly detection, while using superset experience to apply human judgement to alarm instances is known as alarm subscription.
Challenges in Anomaly Detection
Many variations in data trends reflect normal patterns with logical explanations. Monitoring systems thus need to be able to account for ordinary fluctuations in order to prevent a high rate of false alarms, which in turn presents a range of challenges for strategy.
In the following graph of a single day’s activity, it might appear that the sharp decline indicated by the red arrow is an anomaly, considering the relative steadiness of changes to either side of it.
Seen on a timeline of two days, however, this decline appears more clearly as the daily occurrence it represents.
In statistical terms, this phenomenon is called seasonality, and forms the first problem monitoring systems must resolve. As a rule of thumb, no alarm should ever be expected to occur around a fixed time each day.
When the degree to which a specific variable impacts a measured value varies significantly over different points in time, a statistical phenomenon known as heteroscedasticity is at work. Taking the same chart discussed above as an example, the fluctuations among values found at nighttime (indicated by the red arrows) are much greater than those seen during daytime hours.
Heteroscedasticity impacts alarm strategy in that it requires strategies to deploy different approaches for different phases of the day. In this case, management is needed to prevent an excess of alarms requiring an overbearing amount of attention from technicians during nighttime operations.
Just as daily cycles show variations at different periods, so do those spanning weeks or months. When a dataset simultaneously reflects two or more cyclical patterns, a phenomenon known as complex cyclicity is at work.
In the chart above, for example, a common pattern of highs and lows in daily activity is shown across three datasets from different platforms. However, the peak values indicated by the white line show significant variation within the course of a week, with the highest peak occurring on Wednesday and the lowest on Saturday.
Adding to the challenges of complex cyclicity, the number of days in a month ranges from 28 to 31, while even the precise number of days in a year averages to 365.25, rather than an integer.
Evaluating Monitoring Strategies
To assess the quality of a monitoring strategies, several indicators can be applied using a receiver operating characteristic (ROC) curve. These indicators are the false alarm rate, or probability that a normal data point is classified as an anomaly; the detection rate, or probability that an anomaly will be detected; and the anomaly rate, or proportion of anomalies to total points in a data set.
In the ROC curve, the detection rate (DR) is mapped against the false alarm rate (FAR), as shown in the following image:
The graph indicates that any increase in the detection rate is accompanied by an increase in the false alarm rate. More importantly, after the detection rate reaches a certain threshold, each doubling of it leads to an exponential rise in the false positive rate. This means that this detection strategy will be unable to maintain a low false positive rate if it achieves a detection rate close to 1 (i.e., 100 percent). Further, achieving a high detection rate will introduce an exponentially higher rate of damage from false positives. Therefore, this model is unlikely to achieve a detection rate of 99 percent or higher.
While the above example represents data from a testing environment and is already problematic, monitoring complex business data such as transactions on Alibaba’s Taobao platform will yield even more challenging results.
The following problem can further illustrate these difficulties:
Suppose that data points are generated for every minute of the day, yielding 1440 points daily; the anomaly rate is .001, meaning an average of 1.4 points per day should be anomalies; the detection rate is .99, such that there is a 99 percent chance of detecting an anomaly when it occurs; and the false alarm rate is .05, such that the detection strategy has a five percent chance of classifying a normal data point as an anomaly. In light of the above, if an alarm has been activated, what should be the probability of an actual anomaly having occurred in this system?
The answer is that there is only a two percent chance an actual anomaly has occurred. This means that the likelihood of an alarm is ten times higher than that of an anomaly, making this method highly problematic.
Going deeper: Bayesian conditional probability
One important method for establishing confidence in detection results is the Bayesian conditional probability formula, which is specifically applied to determine false positive deviations. It is calculated as follows:
A classic problem where this formula can be applied is the following hypothetical scenario:
Suppose that the incidence of a disease is known to be .001, such that it occurs in one out of every 1000 people. A known reagent can test whether a patient has this illness with an accuracy rate of .99, such that it is 99 percent accurate in cases where the patient is in fact suffering from the illness. The false positive rate, however, is .05, such that five percent of patients without the illness will test positive for it. If a patient’s test result is positive, what should be the probability he or she is ill?
With all other variables held constant, the prior probability (or incidence rate expected regardless of the test) is .001, referred to as P(A). Assuming a positive test, or “B” event, then the goal is to calculate P(A|B) — the posterior probability of the illness following the test. Put another way, this could be treated as the probability of an anomaly (A) in the event of an alarm (B). Using the Bayesian formula, the actual result in this case is about .02, as shown in the calculation below.
In short, the equation indicates that in this hypothetical scenario a patient should seek further confirmation before worrying about a positive test outcome.
Toward Automated Detection Solutions
The ultimate goal in anomaly detection is to increase the detection rate while reducing the false positive rate, and by doing so to achieve simplicity and even automation in monitoring processes. Ideally, this can enable automatic fault detection, automatic emergency response, and other enhanced system-level solutions.
The threshold criteria for pursuing automation strategies is that when an alarm is triggered, the chance of an anomaly having occurred must be greater than 50 percent. To meet this by accurately distinguishing false positives, Alibaba Group has consulted an indicator called the happiness index, defined as the probability that a reported alarm indicates a true anomaly.
While the detection rate and the false positive rate will vary with changes in alarm strategy, the actual number of anomalies is a constant. The following heat map compares the detection rate (on the lateral axis) with the false positive rate (on the vertical axis), with deeper shades indicating a higher position on the happiness index; to facilitate observation, the detection rate has been set to start from 0.9.
The heat map illustrates first of all that the false positive rate is more likely than the detection rate to affect the happiness index. Further, the magnitude by which the detection rate increases has little impact on the happiness index, as seen in the relatively constant coloration when moving along the horizontal axis from any point on the vertical axis. Finally, it shows that the happiness index can only reach the 50-percent threshold when the false positive rate has reached roughly .007 percent — in other words, seven points in every 100,000.
A false positive rate of .007 percent means that if data is monitored once every minute (or 1440 times per day) then roughly ten days will pass between each incidence of a false alarm. For a massive data system like Alibaba’s, this marks an incredible achievement. For perspective, if the happiness index were to reach 99 percent, roughly 190 years would pass between each instance of a false alarm.
Looking Forward for Alibaba
In the ongoing challenge to balance the detection rate with the false positive rate, an ideal outcome requires that the former exceed 90 percent while the latter remains below five percent. Alibaba Group’s current requirement that the detection rate exceed 99 percent inherently results in a high false positive rate, creating a heavy burden for on-duty personnel.
To improve the happiness of its workers, the Group is currently looking beyond single-curve anomaly detection methods to deploy higher-level alarm aggregation, anomaly positioning, and other approaches that challenge traditional modes of thinking. As innovation continues, these efforts offer the best immediate hope of reducing the risk of false negatives while tackling the persistent pain of false positives.