Beyond Hystrix: How Alibaba’s Sentinel Watches Over Distributed Systems

With Sentinel, Alibaba has introduced a flow control system with impressive comparisons against Netflix’s Hystrix component

Image for post
Image for post

This article is part of the Alibaba Open Source series.

In large distributed networks, one key focus for developers is ensuring that a failure at any one node will not generate cascading failures throughout the system. For Alibaba, open source flow control system Sentinel is essential to meeting this requirement, managing the movement of data among its network nodes.

First launched by Alibaba’s middleware team, Sentinel primarily uses traffic as an entry point for enabling users to protect the stability of services in multiple dimensions, including flow control, circuit breaking and downgrading, and system load protection. As such, it differs significantly from the well-known open source component Netflix Hystrix, which Alibaba applied in its previous circuit break and downgrade library.

This article looks in detail at key similarities and differences between these two components, from their resource and execution models to statistics for real-time metrics.

Differences of Focus: Sentinel and Hystrix

According to its official introduction on Github, Hystrix is a library that lets users control interactions between distributed services by adding latency tolerance and fault tolerance logic. Hystrix does this by isolating points of access between services, stopping cascading failures across these services, and providing fallback options, all of which improve the overall resiliency of systems.

Put briefly, Hystrix’s focus is on providing a fault-tolerance mechanism based on isolation and circuit breaking, with which timeouts or circuit-breaking calls can fail quickly and support a fallback mechanism. By contrast, Sentinel key focus areas are diversified flow control, circuit breaking and downgrading, system load protection, and real-time monitoring with dashboard control.

The following sections explore comparisons between the two models in specific areas.

Hystrix’s resource model is designed with a command pattern that encapsulates the call to external resources and the fallback logic into a command object (HystrixCommand / HystrixObservableCommand) for which the underlying execution is based on RxJava. With it, each Command is created with specified commandKey and groupKey (to distinguish resources) and a corresponding isolation strategy (Bulkhead Pattern or Semaphore Isolation Pattern). In the Bulkhead Pattern, users must configure parameters corresponding to the thread pool (the thread pool name, capacity, queue timeout, and so on), after which the Command is executed in the specified thread pool according to the specified fault tolerance strategy. In the semaphore isolation pattern, users must configure the maximum number of concurrent connections. Hystrix limits its concurrent calls when Command is executed.

Compared with Hystrix, Sentinel’s design is far simpler. Sentinel’s resource definition is less coupled to rule configuration than Hyperstrix Command, which relies heavily on isolation rules. The reason for this heavy dependence is that isolation rules directly affect the execution of the Command. At execution time, Hystrix parses the Command’s isolation rules to create the RxJava Scheduler and schedule execution on it. In the thread pool mode, the Scheduler’s underlying thread pool is the configured thread pool, while in semaphore mode, it is simply packaged into the Scheduler executed by the current thread. Sentinel does not specify an execution model, and does not care how the application is executed.

Sentinel’s principle is very simple: perform the current limiting/downgrading/load protection strategy for a resource according to the rules of the corresponding resource configuration. Resource definitions and rule configurations are separate in Sentinel; the user first defines resources for the corresponding business logic through the Sentinel API (event tracking) and then configures the rules when needed.

Sentinel provides two ways to perform event tracking: try-catch mode (through SphU.entry(…)), by which the user performs exception handling/fallback in the catch block, and if-else mode (through SphO.entry(…)), by which the user performs exception handling/fallback when “false” is returned.

In its next version, Sentinel will introduce an annotation-based resource definition, and the exception handler function and fallback function will be specifiable through annotation parameters.

Currently, Sentinel offers diversified rules configuration methods. In addition to registering rules directly into the memory state through the loadRules API, users can also register various external data sources to provide dynamic rules. Users can dynamically change rule configurations based on current real-time situations in the system, upon which the data source will push the changes to Sentinel to take effect immediately.

Isolation is one of Hystrix’s core functions, for which Hystrix provides two isolation strategies: the Bulkhead Pattern and the Semaphore Isolation Pattern.

The Bulkhead Pattern is the more recommended and commonly used of these two strategies. It creates different thread pools for different resources, and with it different service calls occur in different thread pools. These can fail quickly in congestion cases such as thread pool queues and timeouts and can support a fallback mechanism. The advantage of the Bulkhead Pattern is that it has a relatively high degree of isolation, and the isolation can be processed for a resource’s thread pool without affecting other resources. However, this comes at the expense that the overhead of thread context switching is relatively high, with a particularly large impact on low-latency calls.

Despite the above considerations, the Bulkhead Pattern does not deliver many benefits in actual practice. The first reason for this is that the presence of too many thread pools greatly affects performance. For example, in a scenario where a servlet container like Tomcat uses Hystrix and the number of threads in Tomcat itself is very high (possibly reaching several tens or more than 100), if a user adds the thread pools created by Hystrix for each resource, the total number of threads will be very large (reaching several hundred threads) and context switching will generate a very large loss. In addition, the relatively complete isolation of the Bulkhead Pattern allows Hystrix to handle queuing and timeout conditions for different resource thread pools separately, while this is actually a problem for time-out circuit breaking and flow control to solve. If a component has time-out circuit breaking and flow control capabilities, the Bulkhead Pattern becomes less necessary.

Sentinel provides semaphore isolation through flow control in concurrent threads mode. This isolation is very lightweight, limiting the number of concurrent calls to a resource rather than explicitly creating a thread pool so that its overhead is small and yields good outcomes. Combined with the response-based circuit breaking and downgrading mode, the isolation can automatically downgrade slow calls when unstable resources’ average response time is relatively high, preventing excessive slow calls from occupying the concurrent quota, which affects the entire system. By contrast, Hystrix’s semaphore isolation is relatively simple, and cannot automatically downgrade slow calls; rather, it can only wait for the client to time itself out, meaning a cascading blockade may still occur.

Both Sentinel and Hystrix’s respective circuit breaking and downgrading functions are essentially based on the Circuit Breaker Pattern. Sentinel and Hystrix both support circuit breaking and downgrading based on failure ratio (or abnormality ratio), which automatically breaks the circuit when calls reach a certain order of magnitude and the failure ratio reaches a set threshold. In such cases, all calls to the resource are blocked and will not be heuristically restored until after a specified time window has passed. As described above, Sentinel also supports circuit breaking and downgrading based on average response time, which automatically breaks the circuit when the service response time continues to run high, thus rejecting further requests and postponing recovery for a period of time. This can prevent situations where calls become very slow and then lead to a cascading blockade.

Statistics for Hystrix and Sentinel’s respective real-time metrics are implemented based on sliding windows. Prior to version 1.5, Hystrix was based on sliding windows that were implemented by circular arrays and the statistics for each bucket were updated by operation of the lock and CAS. Since version 1.5, Hystrix has begun to reconstruct the implementation of real-time metric statistics, abstracting the metric statistics structure into a form of reactive stream. This makes using metric information convenient for consumers. Meanwhile, its underlying layer has been transformed into an event-driven model based on RxJava. In this mode, when the service call succeeds, fails, or times out, the corresponding event is released; through a series of transformations and aggregations, the real-time metric statistics flow can be obtained and consumed by the Circuit Breaker or Dashboard.

Currently, Sentinel abstracts the Metric statistics interface, and the underlying layer can have different implementations. The current default implementation is based on LeapArray’s sliding window, and subsequent implementations such as reactive stream may be introduced as needed.

Exploring Sentinel: Unique Features and Characteristics

As well as characteristics which overlap those of Hystrix, Sentinel features a number of traits that are unique to it alone, as detailed in the following sections.

As a full-feature, high-availability flow control component, Sentinel’s core “sentinel-core” does not have any extra dependencies. It is less than 200 KB after packaging and very lightweight. Developers can safely introduce sentinel-core without worrying about dependency issues. Meanwhile, Sentinel provides a variety of extension points that users can easily extend and seamlessly fit into Sentinel according to their needs.

The performance loss caused by the introduction of Sentinel is very small. It only has a significant impact (around 5% to 10%) when the service’s stand-alone level exceeds 25W QPS (about 5% — 10%); the loss is almost negligible when the stand-alone QPS is not overly large.

Sentinel can perform flow control on resource calls and adjust random requests to appropriate shapes based on different running metrics such as QPS, number of concurrent calls, and system load according to different call relationships.

Sentinel supports a variety of traffic shaping strategies that automatically adjust traffic to appropriate shapes when QPS is too high. The most commonly used of these are direct rejection mode, in which requests that time out are directly rejected, and slow start warm up mode, which controls the rate at which the flow passes when traffic surges so as to allow the passing flow to gradually increase toward the upper limit within a certain period of time; further, it gives the cooling system a warm-up period to prevent it crashing.

Image for post
Image for post

As illustrated above, constant flow mode (implemented by the Leaky Bucket algorithm) strictly controls the time interval for request passage, while stacked requests are queued and requests exceeding the timeout period are rejected directly.

Image for post
Image for post

As the above figure shows, Sentinel also supports call-relationship-based traffic limiting, including caller-based traffic limiting, callchain-ingress-based traffic limiting, and associated flow traffic limiting. Based on its powerful call chain statistics, Sentinel can provide accurate traffic limiting in different dimensions.

At present, Sentinel does not support asynchronous call chains effectively, but subsequent versions will address these needed improvements.

System load protection

Sentinel provides protection for the system’s dimensions. Its load protection algorithm draws on the idea of TCP BBR. When the system load is high, the system may crash and fail to respond if requests continue to be allowed to enter. In a clustered environment, network load balancing forwards traffic that should be carried by this machine to another machine. If this other machine is also in an edge state at this time, the increased traffic will cause it to crash as well, eventually causing the entire cluster to become unavailable. In response to this situation, Sentinel provides a corresponding protection mechanism to balance the system’s ingress traffic with the system’s load, ensuring that the system handles the most requests its capacity allows for.

Real-time monitoring and control panel

Sentinel provides HTTP APIs for obtaining real-time monitoring information, such as call chain statistics, cluster information, and rule information. Users who are using Spring Boot/Spring Cloud and Sentinel Spring Cloud Starter can easily obtain information about the runtime such as dynamic rules through its exposed Actuator Endpoint. In the future, Sentinel will also support a standardized metrics monitoring API that facilitates the integration of various monitoring systems and visualization systems such as Prometheus, Grafana, and more.

The Sentinel Dashboard provides functions such as machine discovery, configuration rules, viewing of real-time monitoring, and viewing of call chain information, making it easy for users to view and configure monitoring.

Image for post
Image for post

Sentinel has been adapted for Servlet, Dubbo, Spring Boot/Spring Cloud, and gRPC. Users can easily make use of Sentinel’s high-availability traffic protection by introducing appropriate dependencies and making simple configurations. In the future, Sentinel will also adapt to other common frameworks and provide cluster traffic protection for Service Mesh.

Key Takeaways

Sentinel’s comparisons with Hystrix span a number of overlapping areas, as well as areas of emphasis that Hystrix does not address. The following chart offers an organized visual breakdown of the two components’ similarities and differences.

Image for post
Image for post

As part of its open source development, the Sentinel team welcomes discussion and questions from interested readers, who can become involved by visiting the Sentinel Github page.

(Original article by Su He宿何, GitHub ID @sczyh30)

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.

Written by

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store