This article is part of the Academic Alibaba series and is taken from the paper entitled “TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time” by Wei Cao, Yusong Gao, Bingchen Lin, Xiaojie Feng, Yu Xie, Xiao Lou, and Peng Wang. The full paper can be read here.
Increasingly, companies are moving their data and applications from on-premises infrastructure to the cloud, as doing so makes it easier to scale resources, reduce operational costs, and accelerate the implementation and deployment of applications. As such, cloud databases, such as Amazon Relational Database Service (RDS), Microsoft Azure SQL Database, and Google Cloud SQL, are playing an increasingly important role for many enterprises.
This makes the reliability of these databases paramount to the stability of customers’ businesses. Any decline in the service quality of a cloud database could lead to a severe reduction in the performance of user applications, with inevitable knock-on effects for business continuity and user satisfaction. A mission-critical cloud database must demonstrate a smooth end-to-end experience if applications deployed on the cloud are to be run in a stable fashion.
Detecting decreases in performance in real time and finding the cause in a sophisticated network environment is a challenge for traditional database-as-a-service (DBaaS) platforms. Alibaba’s own multi-tenant DBaaS platform, Alibaba Cloud ApsaraDB for RDS, faced this issue too, prompting the Alibaba tech team to develop a new real-time analytics and diagnostics infrastructure called TcpRT.
Distributed DBaaS Platforms: a Double-edged Sword
Some commercial cloud databases are powered by multi-tenant DBaaS platforms and typically adopt a distributed architecture, an architecture that is generally preferred by vendors, as it assists in multi-tenant management, scalability, and availability. However, this can complicate the process of locating the root cause of any decrease in performance.
Many factors influence the performance of a database in a complex network topology, including packet loss, TCP Incast, OS kernel faults, and slow disks. As a result, troubleshooting can prove difficult. Being able to quickly diagnose and rectify the service quality of a database is critical for cloud database vendors, as this means reducing mean time to recovery (MTTR) and improving recovery time objective (RTO).
TcpRT comprises a novel method of collecting tracing data related to database instances in a non-intrusive manner, detecting anomalies in real time.
A lot of useful metrics can be calculated using the collected data, including the end-to-end latency of a SQL request and the time spent in each stage. This information is usually hard to acquire outside the OS kernel, but TcpRT works inside the kernel and is transparent to the DBaaS processes running. It also has little impact on the performance of the database, and when TcpRT is enabled, increases in latency and decreases in throughput are both within 1%.
Designed to be as efficient as possible with minimal performance overhead, the TcpRT system collects database TCP trace events transparently and applies scalable stream computing to process tens of millions of TCP traces every second. It can collect over 20 million raw traces every second and process over 10 billion aggregated results at the backend every day. To reduce the amount of data before it is sent to the backend, raw trace data is aggregated, which is processed, grouped, and analyzed in a stream computing platform.
The system also uses a self-adjustable Cauchy distribution model to detect anomalous events in the database services automatically. Typically, enterprises employ expert database administrators to manually set the thresholds of some of the more important metrics. This has several drawbacks, however. First, thresholds are specific to applications and may need to be changed over time. Second, Alibaba’s DBaaS platform contains hundreds of thousands of database instances, making it essentially impossible to configure each of their thresholds manually. For this reason, TcpRT uses self-adjustable threshold settings that can be trained using historical performance data in each instance.
TcpRT has been deployed successfully in production systems at Alibaba Cloud for the last three years. The paper presents several case studies that detail how TcpRT has been and can be used to resolve users’ problems in their production systems.
The full paper can be read here.
First hand, detailed, and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook