Several Important Database Development Trends
Li Feifei, Vice President and Senior Researcher at Alibaba Group, President of Database Products Business Unit
In the early 1980s, database systems gradually stepped on to the centerstage in the information technology (IT) arena, and since 2000, big data has emerged and is playing a prominent role in the IT industry. The year 2010 marked a significant change in attitude towards cloud computing, which has seen its popularity increase. Information technology is flourishing today, and several new trends are cropping up day by day.
I would like to take this opportunity to share some of my humble thoughts and what we have achieved, in this article.
Three intersecting periods in information technology history
The past 40 years has witnessed the consecutive rise of three technologies — database systems, big data, and cloud computing.
The database management system is one of the three basic software systems in the computer field and was first developed in 1980s. The early relational database management systems (RDBMS) are represented by Oracle Database, which was a huge commercial success. Subsequently, development commenced on several open-source RDBMS such as MySQL and PostgreSQL.
In the 1990s, large amounts of data were being produced with the widespread application of RDBMS. Analysis of such structured data placed higher requirements on analytical database systems, which led to the development of several analytical database management systems during that time.
The first 10 years of the 21st century (from 2000 to 2010), saw the historic emergence of big data technology on the center stage. There are two reasons behind the birth of big data technology: First, the emergence of big data — with the rapid development of internet companies such as Google, a large amount of data is being generated every day. Second, the methods used to acquire, process, and analyze data have changed — for example, even the simplest transaction and transfers conducted by banks have strict requirements on data isolation, consistency, and persistence. However, big data is different, as a single data transaction has no special impact on the final result. This application scenario is completely different from the traditional online transaction-based relational database. The development of big data occurred at the right time and right stage. Google published three highly renowned papers on the Google File System(2003), Google Big Table(2006), and MapReduce(2004), which laid the cornerstone for the current big data technology ecosystem.
After 2010, cloud computing becomes a new trend that is increasingly popular. The essence of cloud computing is to pool resources efficiently by using distributed technology and deploy applications transparently and centrally. When we view cloud computing, database, and big data development together, the data management system is essentially a full-link process that involves data production, processing, consumption, and storage.
Cloud computing has an immense impact on data processing systems: First, the in-depth application of cloud-native technologies in the data processing systems; Second, the rapid fusion of traditional relational databases and traditional big data ecosystem.
The industry development trend is to pool and decouple resources and build the next generation data processing system based on cloud native and distributed technology. For example, the reason why Alibaba Cloud Database can stably and consistently support Double 11 (Shopping Festival) is by constantly analyzing and implementing these concepts.
Taking Double 11 as an example, The first graph shows the curve of the transaction peak at midnight of Double 11 over the years. The latest peak at midnight of Double 11 in 2020 was 580,000 transactions/second. Considering that every transaction also involves order splitting, this means that the database system will have to process millions of TPS per second.
The second graph depicts the instantaneous change curve of system load at midnight. From the figure, we can see that the system load exploded 145 times in one second. If we simply rely on traditional technology and don’t use cloud native technology, it would be impossible to meet the requirements of high concurrency, flexibility, and high availability.
Several important trends
From an architectural perspective, the changes in the database system are shown in the following figure:
On the left is the traditional Von Neumann architecture, on the right is the distributed architecture, and in the center is the cloud native architecture, which utilizes a lot of distributed technology. The flexible and high availability capacity brought about by resource pooling is obvious.
These are the three different architectures today, with the following trends:
· Big data and database integration
· Cloud native and distributed technology integration
· Multi-mode data processing
· Software and hardware integration: for example, use of high-speed networks to improve data processing system performance and efficiency
· Safe and reliable: For example, ensuring data immutability
The above background and trends are realized with Alibaba Cloud database’s core technology:
1- Cloud native relational database PolarDB
Each data block is divided into three physical nodes, regardless of the distribution challenges. For example, distributed queries for sharding tables are completely transparent to the application, and distributed technology transparency and centralized deployment are achieved with respect to one data read/write.
PolarDB is designed following the principle of decoupled compute and storage architecture, and computing node scaling out or storage expansion can be achieved at the minute level. At the same time, the performance has been immensely optimized, and it has a good compatibility with the ecosystem, for example, it is 100% compatible with MySQL and PostgreSQL, and highly compatible with Oracle.
It also has a great competitive advantage among commercial databases in terms of cost performance. In actual customer cases, by replacing the existing Oracle database with the Oracle-compatible PolarDB version, the overall cost can be reduced to less than one third of the original cost with the same performance.
In addition to the cloud native architecture, there is also a distributed architecture version of PolarDB-X. This three-node architecture is implemented in each partition. The three nodes use protocols to ensure data consistency, and the three nodes can be deployed across AZs in the same city.
2- Integrated design is the core concept of the next-generation data analysis system
The next-generation data analysis system combines cloud native and distributed technologies: at the top is distributed technology and at the bottom is cloud native technology. Each partition enjoys flexible and high availability capacity brought by cloud native technology, while at the same time, distributed technology’s scale-out capacity can resolve bottlenecks caused by high concurrency.
3- Cloud-native data warehouse AnalyticDB
The cloud-native data warehouse is essentially a cloud-native architecture that features storage pooling, computing pooling, and decoupled compute and storage. It can be used to achieve flexible and lightweight deployment of massive amounts of storage.
These technologies are used to realize the offline and online integration of data processing and calculation analysis, and the integration of database and big data. Just like a real warehouse, all items are placed under different categories. Therefore, a data warehouse is more suitable for scenarios where the data format is standardized, the business type is relatively fixed, and the cost performance is extremely high.
This is some of the work we have done in cloud native data warehouse. Using this architecture, we have developed AnalyticDB (ADB), supported Taobao and Tmall’s demands for interactive online analysis and calculation of real-time transaction data, and supported the integration of complex offline ETL and online analysis.
4- Data lake
The data at “the bottom of the data lake” is miscellaneous, but “the surface of the lake” is unified. Unlike a data warehouse, a data lake has multi-source and heterogeneous storage, and only a unified interface is needed to analyze and process the data.
We have developed a cloud-native serverless data lake solution — DLA, which performs unified calculation and analysis of multi-source and heterogeneous data storage based on object storage. With the cloud-native serverless technology, we can gain flexible and high availability capacity at incredibly low cost, while meeting all security requirements.
5- Multimodule, intelligent, safe, and reliable
Anomaly detection and safety diagnosis are realized at the management and control layer. We use the Kubernetes (K8s) open-source container-orchestration system to manage multi-source and heterogeneous resources, to create an intelligent operation and maintenance management platform.
We have developed a fully encrypted database, and there is no need to decrypt the data after it enters the kernel. By using secured hardware technology, we can ensure a fully encrypted process and protection and realize decryption-free data processing.
In addition to structured data, diversification of data services has led to multi-modal data such as text, time series, images, graphs, and other unstructured data. To support multi-modal data processing, we have designed and developed a multi-modal database based on cloud-native architecture — Lindorm, and a cloud-native memory database — Tair.
Finally, there are ecosystem tools that have been developed for transmission, backup, and management. DTS is used for end-to-end data synchronization, DBS (Database Backup) is used for multi-cloud and multi-end logical and physical backup for databases, DMS is used for enterprise-level DevOps, and ADAM is used for evaluation and migration of applications developed using traditional databases and data warehouses.
This pandemic has led to immense changes in every industry and every aspect of our lives — traditional offline businesses and online businesses are rapidly converging, and the boundary between online and offline is getting increasingly blurred. The challenge is that the business peaks and valleys are changing ever more drastically. This is an inevitable change brought about by the pandemic, and digital transformation is also an inevitable fact.
In this context, cloud-native database ApsaraDB for PolarDB and cloud-native data warehouse AnalyticDB have not only ably supported Double 11 but have also served all industries during the pandemic, especially traditional industries with increasingly blurred line between online and offline such as online education and games.