Aiming for Corporate Level Cloud-Native, Alibaba Cloud CNFS: Solving the Container Persistent Storage Dilemma

Introduction: Driven by the cloud-native trend, applications are increasingly moving toward the containerized model. Meanwhile, Kubernetes is evolving into a new infrastructure for the cloud-native era. Forrester believes that enterprises and organizations worldwide will run their applications in containerized production environments by 2022. Two general phenomena can be observed through today’s container and Kubernetes applications. Firstly, cloud-based Kubernetes has become enterprises’ preference when they move their businesses to the cloud and containerize their applications. Additionally, users use containers in new ways, including stateless applications, core enterprise applications, and data intelligence applications. More and more enterprises use containers to deploy complex, stateful production applications that require high computing performance, such as web services, content libraries, databases, and even DevOps, AI, and big data applications.

Driven by the cloud-native trend, applications are increasingly moving toward the containerized model. Meanwhile, Kubernetes is evolving into a new infrastructure for the cloud-native era.

Forrester believes that enterprises and organizations worldwide will run their applications in containerized production environments by 2022. Two general phenomena can be observed through today’s container and Kubernetes applications. First, cloud-based Kubernetes has become the preferred choice of enterprises when moving their businesses to the cloud and containerizing their applications. Additionally, users apply containers in new ways, including stateless applications, core enterprise applications, and data intelligence applications. More and more enterprises use containers to deploy complex, stateful production applications that require high computing performance, such as web services, content libraries, databases, and even DevOps, AI, and big data applications.

How do we find solutions to implement data orchestration and storage for large numbers of containers in the cloud-native era? How shall we improve the performance and stability of containerized storage?

Evolution of storage capacity driven by application containerization

The resulting technical system calls for new types of storage systems with the following features.

1. High storage density

2. Elasticity

3. Data isolation

What kind of storage capabilities does an enterprise need in a containerized environment?

1. Compatibility with applications

2. Extreme elasticity

3. Data sharing

4. Security and reliability

5. Cost reduction

Alibaba Cloud Container Network File System (CNFS)

CNFS thoroughly optimizes containerized storage in terms of elasticity and scalability, performance optimization, accessibility, observability, data protection, and declarative mechanisms. This product outperforms its peers in the following areas:

  1. Storage types: CNFS supports both file storage and object storage. It currently supports Alibaba Cloud’s NAS, CPFS, and OSS cloud solutions.
  2. CNFS supports declarative lifecycle management compatible with Kubernetes, allowing for all-in-one management of containers and storage resources.
  3. CNFS supports online and automatic scaling of PVs (persistent volumes) to improve the elastic scalability of containers.
  4. CNFS supports better interaction with Kubernetes to protect data, with features such as PV snapshots, recycle bin, deletion protection, and data encryption.
  5. CNFS supports application consistency snapshots, automatic analysis of application configurations and storage dependencies, and one-click backup and recovery at the application level.
  6. CNFS supports PV-level monitoring.
  7. CNFS provides better access control and improves the permission security of shared file systems, with features such as directory-level Quota and ACL.
  8. CNFS optimizes the performance of small file reading and writing in the file storage within milliseconds.
  9. CNFS reduces storage costs by providing low-frequency media and conversion strategies.

Typical scenarios and best practices

1. Extremely elastic container scenarios

For such applications, pods must be able to mount and unmount storage PVs flexibly. Mounting a storage PV requires a quick boot of the container and may produce a high file I/O load. As the massive amount of persistent storage data proliferates, the storage costs and stress also surge. Given this, we recommend the ACK + CNFS + NAS combo for the following advantages:

· The built-in file storage class enables the launch of several thousand containers within a short period and the mounting of file storage PVs within milliseconds.

· The built-in NAS provides shared reads and writes for the massive amount of containers to quickly achieve containerized applications and high data availability.

· With the low latency and small file optimization, the solution enables data reads and writes within microseconds to satisfy file storage performance requirements in the event of highly concurrent container access.

· The solution supports file storage lifecycle management and automatic hot and cold data classification to reduce storage costs.

2. AI container scenarios

· AI data flows are complicated and may be restricted by data I/O bottlenecks in the storage system.

· AI training and reasoning require high-performance computing and storage.

· In AI computing power collaboration, cloud and IDC resources/applications require unified scheduling.

We recommend the ACK clusters + CNFS + NAS/CPFS combo for the following advantages:

· The optimized NAS read and write performance improves shared storage performance to meet the needs of AI scenarios. This solution supports access to a massive amount of small files and faster AI training and reasoning.

· Computing clusters adapted to the containerized environment such as GPU cloud drives and bare-metal servers (X-Dragon) ensure super-high throughput and IOPS. CPFS also supports on-cloud/off-cloud hybrid deployment.

· ACK clusters support IDC-built Kubernetes clusters managed via ACK to form a uniform resource pool on and off the cloud. It enables uniform scheduling of heterogeneous resources/applications to maximize the computing advantages of massive cloud-based infrastructure.

3. Genomic computing scenarios

· Mining data from a large set of samples requires massive computing and storage resources. In addition, the data size snowballs, storage costs are high, and management is challenging.

· Massive data volumes need to be quickly and securely distributed to multiple locations in China for shared access from multiple IDCs.

· Batch processing of samples takes a long time and requires high computing performance with significant request peaks and troughs, which makes planning difficult.

Because of the features of genetic computing scenarios, we recommend the ACK + AGS + CNFS + File Storage NAS + OSS combo for the following advantages:

· The NFS’ built-in file storage class enables a fast, inexpensive, and precise genetic computing container environment.

· CNFS supports OSS PVs to save lower computer and post-mounting data and analyze result data for distribution, archive, and delivery. This way, a large number of users can upload and download data simultaneously, improving the data delivery efficiency. At the same time, CNFS also provides massive storage capacity and lifecycle management capabilities to archive and store cold data at a lower cost.

· AGS supports GPU accelerated computing of hot data for genetic computation, offering performance 100x better than traditional modes for significantly faster and cheaper gene sequencing.

Besides the typical scenarios above, CNFS offers optimized container and storage integrated solutions for services in many other applications. For more information, see https://help.aliyun.com/document_detail/264678.html.

Case Studies: CNFS and File Storage for a Modernized Enterprise Application

Video service

BAIJIAYUN is a leading Chinese comprehensive video service provider. During the COVID-19 pandemic, their traffic increased dozens of times over, and they had to expand in a way that was imperceptible to users rapidly. Additionally, BAIJIAYUN’s service scenarios feature massive read and write volumes and its computing clusters horizontally expanded to four in number. During the recording transcoding process, its previous storage system created an I/O bottleneck, placing significant limitations on BAIJIAYUN’s ability to deal with high traffic volumes and high concurrency.

applications and allow data access immediately after scaling. Ultimately, BAIJIAYUN chose Alibaba Cloud’s container service ACK and file storage service NAS, allowing it to optimize its container cluster architecture and successfully expand its capacity ten times within three days.

NAS supports automated scaling as needed based on ACK. As a result, several thousand containers can be launched quickly to help the elasticity of containerized applications ideally. NAS provides standard APIs that are compatible with powerful transcoding software to mount video editing workstations easily. BAIJIAYUN has exceptionally high requirements for Kubernetes cluster performance. Thanks to the high-performance NAS service, the solution ensured a high throughput of up to 10 GB, which cracked the I/O bottleneck while meeting the high traffic volume and concurrency requirements of BAIJIAYUN’s scenarios. As a result, the company’s live-streaming and recording services were launched smoothly during the pandemic.

Autonomous driving

The second case is a well-known customer engaged in the auto industry. As a leading Chinese smart car manufacturer, the customer is also an emerging tech firm at the forefront of internet + AI integration. Its products are powered by multiple AI technologies and services, such as speech assistants and autonomous driving.

The company’s challenge was the hundreds of millions of small images (100 KB each and more than 100 terabytes in total) used as training resources for its autonomous driving system. During training, the GPU must repeatedly and randomly access these images, which requires the file system to provide high file access IOPS to accelerate the training process. However, as the storage system grew in prominence, its stability and performance failed to increase at the same pace as its size. In addition, other issues, including high costs and complex operations, maintenance, and management, also arose with the increase in storage resources.

support for the customer’s HPC platform for smart driving, speeding up the random access to small files during training by 60%. Data is stored at multiple data nodes in the cluster to support simultaneous access from multiple clients and parallel extension. Moreover, NAS supports multi-level storage data flows, which significantly streamlines autonomous driving data acquisition, transmission, and storage processes.

Genomic computing

The last case study is from the genomic computing sector. The customer is a world-leading life science institution focusing on cutting-edge science. Challenges: The customer’s data grew fast, and their current storage system could not meet their capacity and linear expansion requirements, creating an I/O bottleneck in their genomic computing system. The high sample data size resulted in increased storage costs and complex management.

With Apsara File Storage NAS (NAS) mounted on a containerized cluster, the solution enabled high performance in genomic data computing and data analysis using shared storage. The storage stored the lower computer data and the post-mounting data and intermediate data generated during the process, ensuring low latency and high IOPS for the containerized storage. The storage performance increased from 1 GB/s to 10 GB/s, with end-to-end data processing, including data migration to the cloud and result distribution from the cloud, taking less than 12 hours.

NAS provides elastic bandwidth with high throughput. NAS divides the capacity as needed and allocates appropriate bandwidth to ensure service elasticity at a lower total cost of ownership (TCO) based on the service scale. NAS ensures efficient genomic computing at a lower cost with a uniform process and uniform scheduling of on-cloud and off-cloud heterogeneous computing resources.

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!