Computational Storage: Data Compression and Database Computing Pushdown

By Zhongzhe Xiong, ScaleFlux

“The chips are down for Moore’s law”, an article published on Nature on February 9, 2016, claimed that the upcoming International Technology Roadmap for Semiconductors (ITRS) would no longer pursue Moore’s Law, breaking the golden rule that dominated the chip industry for 50 years.

Figure. 1

In its narrow sense, Moore’s Law refers to Gordon Moore’s observation that the number of transistors on a microchip doubles every 18 to 24 months even as the cost of computers is halved. It relates the price to the integration level. However, Moore’s Law that we talk about here includes three important rules: Moore’s Law, Dennard Scaling, and Pollark’s Rule.

· Dennard Scaling states that as transistor density increased, power consumption per transistor would drop, so that the power per mm2 of silicon would be near constant. Because the computational capability of a mm2 of silicon was increasing with each new generation of technology, computers would become more energy efficient.

· Pollack’s Rule states that with the same process, microprocessor performance increase is roughly proportional to the square root of the increase in the number of transistors on a microchip.

· Dennard scaling began to slow significantly in 2007 and faded to almost nothing by 2012, as shown in the following figure.

Figure. 2

Therefore, the “end of Moore’s Law” means the end of the user value triangle (price, integration, and performance) built on top of the three important rules.

In 2017, Turing Award winners John L. Hennessy and David A. Patterson proposed to use heterogeneous processors to optimize the design time and cost in their article “A New Golden Age for Computer Architecture”.

As the focus of innovation in architecture shifts from the general-purpose CPU to domain-specific and heterogeneous processors, we will need to achieve major breakthroughs in design time and cost.

In heterogeneous computing, compute units in different instruction sets (reduced instruction set computing and complex instruction set computing) of different system architectures are combined. Tasks are assigned to the most appropriate compute units (including CPUs, GPUs, and FPGAs) to maximize the advantages of the computing units.

Computational storage can be simply understood as adding a computing unit (such as an FPGA) to the existing storage medium (such as an NVMe SSD). The computing unit then accelerates computing tasks that are directly related to storage, offloading CPU-intensive computing tasks. However, for persistent applications, computational storage is hard to give full play to its real value without insights into important issues, understanding of the tradeoff of existing solutions, and innovative designs. Dr. Yale Patt, a leading authority in computer architectures, proposed that the role of an architect is to look backward (to the past), look forward (to the future), look up (to the problem), and look down (to the device). This is also applicable in the storage field.

First, let us look up (to the problem). Storage implements the transfer of information over time. Complaints against storage is always “not fast enough, not large enough”. The emergence of SSDs has greatly improved the storage performance (IOPS and latency). However, the ever-decreasing price still cannot keep up with the explosive growth of data. The nature of SSDs determines that capacity not only affects the cost, but also affects the performance. Unlike memory and HHDs, SSDs cannot have data directly overwritten. A block must be erased before a clean page can be written. When the SSD has less available space because of a large amount of data fragments, valid data is read from the entire block and rewritten to the erased block. This process is called Garbage Collection (GC) and leads to write amplification. The latency of a single erase operation is several times that of a write operation, and the latency of a write operation is dozens of times that of a read operation. In scenarios where both read and write operations are performed, GC may cause delay jitters and affect the performance. To reduce the GC frequency, SSDs use optimized GC algorithms such as the greedy reclaiming policy, as shown in the following figure.

In addition, over-provisioning (OP) is implemented. The OP of enterprise-grade SSDs is typically up to 28%, and that of consumer-grade SSDs is typically up to 7%. Researches by IBM Research indicate that the write amplification factor is between 3.5 and 4.5 when the remaining space of SSDs is 10%, and the write amplification factor can be as low as 0.2 when the remaining space is 30%, as shown in the following figure.

Therefore, a reduction in the amount of written data not only saves space but also optimizes performance. A variety of compression algorithms for different scenarios are available in the industry, such as zstd, zlib, brotli, quicklz, lzo1x, lz4, lzf, and snappy. Existing solutions can be classified into soft compression (CPU-based) and hard compression (FPGA-based).

In an American comedy called Silicon Valley, Richard, a young computer genius, invented the “middle-out” algorithm that goes beyond the theoretical limit, and established the Pied Piper company from it.

As shown in the preceding figure, the CPU provides the computing power required for compression and decompression. The following problems may occur if you sacrifice CPU resources for storage space:

· CPU preemption: A large amount of CPU resources are occupied. The CPU also competes with applications for CPU resources.

· Bandwidth preemption caused by data replication: Frequent and large amounts of data replication operations (DRAM <- -> L3 Cache <- -> L2 Cache <- -> L1 Cache <- -> Registers) are performed between the DRAM and the CPU, which preempts the PCIe bandwidth and memory bandwidth of the server, potentially leading to a CPU cache miss. In this case, the computing efficiency is affected.

· Unstable latency: Due to CPU preemption and bandwidth preemption, the clock interrupts and task scheduling in the OS increase the uncertainty of latency when the OS load is high, which is unacceptable to IO-intensive services.

As shown in the preceding figure, the dedicated FPGA provides the computing power required for compression and decompression, releases CPU resources, and implements CPU-Offload, although not thoroughly. The FPGA is unable to implement complete zero-copy even when the DMA technology is used. Frequent and large amounts of data replication operations still exist between the DRAM and the FPGA, and preempt a large amount of server bandwidth resources. Furthermore, the addition of FPGAs to the data link inevitably increases the I/O latency, especially in small data blocks (such as the ones of 4 KB, 8 KB, and 16 KB in size) of databases and high-speed block storage systems.

Computational storage provides the following solutions to address these issues:

· CPU-Offload: FPGAs are used to complete compression and decompression calculations to implement CPU-Offload. When we look down (to the device), FPGAs offer lower latency and are suitable for compute-intensive tasks, such as matrix operations, compression, and asymmetric encryption. First, the Cache and DRAM are integrated on a chip to minimize interaction with the CPU and avoid OS process scheduling and inter-process interference. This allows predictable latency. FPGAs are also designed based on the custom multiple instruction, multiple data (MIMD) architecture. The pipeline parallelism and data parallelism features can further reduce latency. The following figures (Figures 5 and 6) show the query latency when FPGA is applied to search rankings on Bing. A lower average latency is implemented by using FPGAs.

Figure. 5 & Figure6

· Zero-Copy: Computational storage uses built-in FPGAs to allow compression and decompression tasks to be performed within a disk without changing the original data path. This method is also known as in-situ processing. It can prevent additional data replication, which is why computational storage is also called near-storage computing.

The following figure shows the process of computational storage.

Now let us look forward (to the future). International Data Corporation (IDC) estimates that the total data volume across the globe will reach 175 ZB by 2025. Even with the help of compression technologies, the capacity of storage media will develop at an increasingly slower pace when compared with the growth of data volume.

A simple example shows the contrast. How long does it take for 32 PCIe 3.0 lanes, 32 PCIe 4.0 lanes, and 32 PCIe 5.0 lanes to transmit 1 PB of data from the storage drives to the DRAM? How long does it take if the data stored on the storage array is transmitted over a 100 Gbit/s network? The following figure presents the results.

Finally, let us look backwards (to the past). In a modern processor system, the high-speed CPU cache is at the top of a memory system and is followed by the DRAM and storage drives. The high-speed CPU cache is typically made up of three segments: L1, L2, and L3 cache. Based on temporal locality, the CPU reads data from all cache levels till it reaches the DRAM. If a cache hit occurs in the high-speed CPU cache, the DRAM is not accessed, which significantly reduces the access latency. The following figure shows the access path.

In an online analytical processing (OLAP) scenario, if a job runs at a low frequency and the data between different jobs is not closely related, the existing cache system becomes extremely inefficient or even experiences exceptions. This may cause the hot data to be swapped out, which leads to a cache miss and a sharp drop in application performance. A variety of solutions are available in the database industry. The following example shows the solutions on an Oracle database:

· Shorten the path of data transmission: By default, data is added to the high-speed buffer cache maintained by the database. Oracle 11g is the first to scan large tables (2% of the buffer cache in size by default) from direct paths. This allows data to bypass the buffer cache and avoids the drop in cache hit ratio caused by swap-out of hot data.

· Reduce the amount of transmitted data: Oracle Exadata Smart Scan significantly reduces the amount of data transmitted between the storage node and database node by pushing down or offloading most SQL operations to the storage node.

Data endlessly grows, but hardware capacity eventually reaches a limit. Therefore, reducing the amount of transmitted data is a better longer-term solution.

Alibaba Cloud also tightly worked with ScaleFlux to reduce the amount of transmitted data in its PolarDB. This solution is touched upon in a paper named “POLARDB Meets Computational Storage: Efficiently Support Analytical Workloads in Cloud-Native Relational Database” and has been elaborated on at FAST 2020 by Prof. Zhang Tong, Chief Scientist of ScaleFlux.

What does pushdown mean? The Index Condition Pushdown (ICP) feature of MySQL can help beginners gain a better understanding of pushdown.

When the ICP feature is not enabled, the database searches for data from the storage engine based on the first index condition column, and extracts the entire row to the database instance layer. Then, the database instance layer filters data rows based on the other conditions in the WHERE clause. The following figure shows the process.

Figure. 8

After the ICP feature is enabled, if the conditions in the WHERE clause contain both search columns and filter columns and a multi-column index is created on these columns, the database instance layer pushes down the filter columns to the storage engine layer. The storage engine layer then filters out the data that does not meet the conditions and reads and returns only the required data. This reduces the transmitted data volume and table access by index row ID. In most cases, this significantly improves the query efficiency. The following figure shows the process.

Figure. 9

Although MySQL ICP pushes down the filtering function from the MySQL Server layer to the storage engine layer, CPU resources are still consumed. Technically, this is not a true pushdown. To further reduce the consumption of CPU cycles, step 4, as shown in Figure 9, can be pushed down to computational storage due to the following reasons:

· Significant benefits: Computational storage can be a key factor in pushing down the task from the instance layer to the storage engine layer. This complies with the near-storage computing principle and results in greater benefits.

· Low costs: From the perspective of call relationships, computational storage has minimal impacts on the database instance layer. Most changes can be completed at the storage engine layer. This reduces the costs for modification and verification.

· FPGA-friendliness: Computational storage enables easy parallelism, which is favorable to compute-intensive tasks, such as compression, encryption, computing, filtering, and aggregation.

The following figure shows the re-designed process.

Figure. 10

An in-depth discussion of the details is beyond the scope of this article. However, some examples are given in the following section:

In compression and decompression scenarios, it is easy to ensure the ultimate compression ratio or performance. However, for persistent services, low latency and high compression ratios are both necessary. Computational storage meets the two requirements by implementing data compression to reduce storage costs while providing stable I/O latency. In terms of implementing commercial products, lots of considerations need to be made, such as how to select the types of FPGAs based on resources and power consumption, how to implement compression algorithms to be more friendly to FPGAs, how to provide predictable latency for data with various compression ratios, how to provide a smooth user experience that is transparent to applications, and how to implement the variable-length mapping between logical block addresses (LBAs) and physical block addresses (PBAs), and etc.

In computing pushdown scenarios, considerations include how to identify the underlying CSD device and the exposed pushdown interface, how to transmit pushed-down conditions to the hardware, how to optimize the internal logic of the device (streaming and parallel data filtering), how to modify the format of stored data to make it favorable to streaming, and how to bypass the file system to modify the existing monitoring policies.

From dedicated computing to the general-purpose computing initiated by Intel, and to today’s heterogeneous computing, history seems to be repeating itself, but with new connotations added. Innovations in computer science are not always eureka moments of geniuses. They are more often solutions from different perspectives based on deep understandings of the existing software and hardware systems. Computational storage will bring far-reaching impacts and changes to stateful applications, especially databases.

What we have before us are some breathtaking opportunities disguised as insoluble problems.

Author: Zhongzhe Xiong, ScaleFlux

Founded in 2014, ScaleFlux is a leader in deploying complex computing and solid-state storage solutions at scale. Computational storage is the foundation of the modern data-driven architecture that provides low-latency, scalable, and agile capabilities for compute- and I/O-intensive applications.

· A New Golden Age for Computer Architecture

· A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

· A Cloud-Scale Acceleration Architecture

· A Brief Introduction to Cache Memory

· wiki Locality of reference

· Index Condition Pushdown Optimization

· Write Amplification Analysis in Flash-Based Solid State Drives

· Write Amplification

Source: ScaleFlux

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!