A Deep-Dive on PolarDB Parallel Scanning Technology

1. Background

As data grows, SQL execution tends to take longer. This imposes higher requirements on the optimization capability and execution modes of databases. As the promise of the cloud expands to the development of databases, traditional users are keen to migrate their business to the cloud to make use of the elasticity and scalability of the cloud. However, with rapid business expansion, and despite the dynamic resource scale-outs enabled by the cloud, SQL execution remains slow and fails to meet expectations. In many cases, the resource scale-outs are not fully utilized. Traditional commercial database services, such as Oracle and SQL Server, support parallel query engines to maximize the use of system resources and accelerate SQL execution.

2. How Parallel Queries Work

A query similar to online analytical processing (OLAP) involves a significant volume of data. This volume exceeds the memory capacity of the database. Most of the data may not be cached to the buffer of the database. Instead, the data must be dynamically loaded to the buffer when the query is executed. This causes a large number of I/O operations, which are time-consuming. Therefore, to accelerate SQL execution, the first consideration is how to accelerate I/O operations.

3. Select Tables for Parallel Scanning

The selection of tables for parallel scanning is the foundation of parallel execution plans. You must calculate and compare the costs of parallel scanning of different tables and then select the tables suitable for parallel scanning. This is the first step of iterating parallel execution plans. The calculated costs of parallel scanning may help you find a better JOIN order, especially when a large number of tables are to be joined, and thus, more iteration space is needed. You can retain the original JOIN order to make the optimization process more efficient. In addition, the access methods of the tables to be joined may vary. Access methods include full-table scanning, index scanning, and index range scanning. These access methods also affect the final costs of parallel scanning.

SELECT c.c_name, sum(o.o_totalprice) as s
FROM customer c, orders o
WHERE c.c_custkey = o.o_custkey
AND o_orderdate >= '1996-01-01'
AND o_orderdate <= '1996-12-31'
GROUP BY c.c_name

4. Join Multiple Tables Selected for Parallel Scanning

You can select multiple tables for parallel scanning. For example, if your database contains two or more fact tables, you can select more than one table for parallel scanning. The following code shows an example:

SELECT o.o_custkey, sum(l.l_extendedprice) as s
FROM orders o, lineitem l
WHERE o.o_custkey = l.l_orderkey
GROUP BY o.o_custkey
  • In the second solution, after the shared hash table named Build is created, each thread reads a shard from the Probe table and then performs a hash join. The Probe table does not need to be sharded based on a hash key, but the shards read by threads cannot overlap.

5. Parallel Execution of Complex Operators in Statistical Analysis

GROUP BY operations are essential for statistical analysis. In particular, if a JOIN operation generates a large amount of data, the following GROUP BY operation on the results of the JOIN operation is the most time-consuming in SQL execution. Therefore, the parallel execution of GROUP BY operations is also a top priority for parallel query engines.

6. Linear Acceleration of Parallel Query Engines for TPC-H Queries

The following figure shows the effects of a parallel query engine on accelerating TPC-H queries. All TPC-H queries were accelerated, and overall, the queries were executed nearly 13 times faster. In all, the execution of 70% of the queries is over eight times faster. Queries in Q6 and Q12 were executed over 32 times faster.

7. Summary

Databases are the core of application systems, and optimizers are the core of databases. The success or failure of a database service depends on the quality and performance of its optimizer, and the creation of a brand new optimizer is a huge challenge to a development team. The stability of the optimizer is a great obstacle, not to mention its technical complexity. Therefore, even traditional commercial database providers can only improve existing optimizers and enhance the support for parallel execution in the pursuit of building a mature parallelization optimizer. The same is true for Alibaba Cloud PolarDB. In the design and development of the parallel query engine, Alibaba Cloud fully utilizes its accumulation of optimizer technologies and implementation base and continues to make improvements. With these efforts, Alibaba Cloud provides an optimizer technology solution with constant iterations that ensure the stable performance of the improved optimizer and pave the way for technological innovation.

About Us

PolarDB is a cloud-native distributed relational database service developed by Alibaba Cloud. In 2020, PolarDB was named a Leader by Gartner and won the first prize during the Science and Technology Award of the Chinese Institute of Electronics. PolarDB is based on a cloud-native distributed database architecture and provides the capabilities of large-scale online transaction processing and parallel processing of complex queries. PolarDB is a leader in the field of cloud-native distributed databases and has been recognized by the market. PolarDB demonstrates best practices within Alibaba Group. During the 2020 Tmall Double 11, PolarDB provided full support and a processing capability of 140 million transactions per second during traffic peaks, which hit a record high.

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!