Aiming for Corporate Level Cloud-Native, Alibaba Cloud CNFS: Solving the Container Persistent Storage Dilemma

11 min readAug 31, 2021

Introduction: Driven by the cloud-native trend, applications are increasingly moving toward the containerized model. Meanwhile, Kubernetes is evolving into a new infrastructure for the cloud-native era. Forrester believes that enterprises and organizations worldwide will run their applications in containerized production environments by 2022. Two general phenomena can be observed through today’s container and Kubernetes applications. Firstly, cloud-based Kubernetes has become enterprises’ preference when they move their businesses to the cloud and containerize their applications. Additionally, users use containers in new ways, including stateless applications, core enterprise applications, and data intelligence applications. More and more enterprises use containers to deploy complex, stateful production applications that require high computing performance, such as web services, content libraries, databases, and even DevOps, AI, and big data applications.

Driven by the cloud-native trend, applications are increasingly moving toward the containerized model. Meanwhile, Kubernetes is evolving into a new infrastructure for the cloud-native era.

Forrester believes that enterprises and organizations worldwide will run their applications in containerized production environments by 2022. Two general phenomena can be observed through today’s container and Kubernetes applications. First, cloud-based Kubernetes has become the preferred choice of enterprises when moving their businesses to the cloud and containerizing their applications. Additionally, users apply containers in new ways, including stateless applications, core enterprise applications, and data intelligence applications. More and more enterprises use containers to deploy complex, stateful production applications that require high computing performance, such as web services, content libraries, databases, and even DevOps, AI, and big data applications.

How do we find solutions to implement data orchestration and storage for large numbers of containers in the cloud-native era? How shall we improve the performance and stability of containerized storage?

Evolution of storage capacity driven by application containerization

Today’s computing and application landscape are undergoing tremendous changes as our infrastructure evolves from physical servers to virtual servers, containerized environments such as Kubernetes, and even serverless computing models. The most significant difference is that applications, which in the past would exclusively occupy a CPU memory partition on a virtual server, have evolved to use function-based service provision under the serverless model.

The resulting technical system calls for new types of storage systems with the following features.

1. High storage density

In the virtual machine era, one virtual machine had one independent storage space to store all the specific application data. However, the Kubernetes serverless environment uses a shared storage space. Every container needs to access a vast resource pool, resulting in high storage density and higher storage capacity requirements.

2. Elasticity

After creating a physical or virtual server, its storage media is generally used in relatively stable cycles. However, in today’s containerized environment, the needs of front-end computing services change dramatically and unpredictably. They may need dozens of servers at one time and then several hundred servers soon after, requiring highly elastic storage resources.

3. Data isolation

In a Kubernetes serverless environment, it is hard to guarantee the exclusivity of memory or storage resources. The reason for such difficulty is that the storage and computing resources, including the operating system and some underlying dependencies packages, are all shared in a containerized environment. We need to implement secure isolation at the infrastructure level and data isolation at the application level, the upper level, via well-structured security policies and solutions: a considerable change and challenge.

What kind of storage capabilities does an enterprise need in a containerized environment?

Block storage, file storage, and object storage are standard containerized storage solutions. What kind of file storage capabilities does an enterprise need in a containerized environment?

1. Compatibility with applications

We may find it hard to rapidly transform the general application modes of enterprises, which usually adopt shared storage or distributed storage clusters in many scenarios. The situation highlights the importance of storage service compatibility with applications. We must ensure data consistency between containerized environments and non-containerized environments to minimize or eliminate the work required to adapt applications.

2. Extreme elasticity

A significant feature of containerized deployment is the elasticity that responds to the rapid changes in resource needs during business peaks and troughs. As the computing resources at the upper level change elastically, the storage resources at the lower level are also required to respond quickly. They must not take a long time to sync underlying data.

3. Data sharing

The datasets used by big data and high-performance computing (HPC) applications are enormous, often in the terabytes and sometimes in the hundreds of terabytes. If this massive amount of data can’t be shared but instead copied and transferred in an elastic containerized environment, this will result in increased costs and lost efficiency.

4. Security and reliability

No matter how abstract and regardless of its underlying infrastructure being physical servers, virtual servers, Kubernetes containers or a serverless environment, the most fundamental requirement of a business is security. There can be no data contamination between applications. The storage systems must be built on top of the sharing layer to ensure data security.

5. Cost reduction

Enterprises are constantly working to lower their cost in all application scenarios. Cost control remains a topic in the most critical application scenarios of an enterprise. As a service grows and changes, its data storage requirements may multiply rapidly. Enterprises must support the rapid growth of data storage capacity while optimizing the storage costs. It is a great challenge.

Alibaba Cloud Container Network File System (CNFS)

To leverage the advantages and meet the challenges of container file storage, Alibaba Cloud has released Container Network File System (CNFS), which is built-in to Alibaba Cloud Container Service for Kubernetes (ACK). CNFS abstracts the Apsara File Storage NAS (NAS) into a Kubernetes Custom Resource Definition (CRD) for the independent management of O&M operations, including creation, deletion, description, mounting, monitoring, and scaling. It is then more accessible for users to use containers to store files, improves the performance and data security, and ensures inter-container consistency via declarative management.

CNFS thoroughly optimizes containerized storage in terms of elasticity and scalability, performance optimization, accessibility, observability, data protection, and declarative mechanisms. This product outperforms its peers in the following areas:

Storage types: CNFS supports both file storage and object storage. It currently supports Alibaba Cloud’s NAS, CPFS, and OSS cloud solutions.
CNFS supports declarative lifecycle management compatible with Kubernetes, allowing for all-in-one management of containers and storage resources.
CNFS supports online and automatic scaling of PVs (persistent volumes) to improve the elastic scalability of containers.
CNFS supports better interaction with Kubernetes to protect data, with features such as PV snapshots, recycle bin, deletion protection, and data encryption.
CNFS supports application consistency snapshots, automatic analysis of application configurations and storage dependencies, and one-click backup and recovery at the application level.
CNFS supports PV-level monitoring.
CNFS provides better access control and improves the permission security of shared file systems, with features such as directory-level Quota and ACL.
CNFS optimizes the performance of small file reading and writing in the file storage within milliseconds.
CNFS reduces storage costs by providing low-frequency media and conversion strategies.

Typical scenarios and best practices

1. Extremely elastic container scenarios

Many applications face sudden spikes in request volumes. These scenarios require a quick expansion of the number of containers and highly elastic resources. Container storage provides the general elasticity and rapid scalability needed to meet these requirements. Typical applications of this type include media, entertainment, live-streaming, web services, content management, financial services, gaming, continuous integration, machine learning, and HPC.

For such applications, pods must be able to mount and unmount storage PVs flexibly. Mounting a storage PV requires a quick boot of the container and may produce a high file I/O load. As the massive amount of persistent storage data proliferates, the storage costs and stress also surge. Given this, we recommend the ACK + CNFS + NAS combo for the following advantages:

· The built-in file storage class enables the launch of several thousand containers within a short period and the mounting of file storage PVs within milliseconds.

· The built-in NAS provides shared reads and writes for the massive amount of containers to quickly achieve containerized applications and high data availability.

· With the low latency and small file optimization, the solution enables data reads and writes within microseconds to satisfy file storage performance requirements in the event of highly concurrent container access.

· The solution supports file storage lifecycle management and automatic hot and cold data classification to reduce storage costs.

2. AI container scenarios

More and more AI services have moved their data training and reasoning processes to containers. The massive cloud infrastructure capacity and its integration with IDCs also enable more flexible computing power scheduling for AI applications. When an AI service performs training and reasoning on the cloud, the application will generate massive datasets. In the autonomous driving field, for example, some datasets can reach 10 PB or exceeding 100 PB. Therefore, we need to ensure that AI applications can be efficiently trained on such large datasets, which poses the following challenges:

· AI data flows are complicated and may be restricted by data I/O bottlenecks in the storage system.

· AI training and reasoning require high-performance computing and storage.

· In AI computing power collaboration, cloud and IDC resources/applications require unified scheduling.

We recommend the ACK clusters + CNFS + NAS/CPFS combo for the following advantages:

· The optimized NAS read and write performance improves shared storage performance to meet the needs of AI scenarios. This solution supports access to a massive amount of small files and faster AI training and reasoning.

· Computing clusters adapted to the containerized environment such as GPU cloud drives and bare-metal servers (X-Dragon) ensure super-high throughput and IOPS. CPFS also supports on-cloud/off-cloud hybrid deployment.

· ACK clusters support IDC-built Kubernetes clusters managed via ACK to form a uniform resource pool on and off the cloud. It enables uniform scheduling of heterogeneous resources/applications to maximize the computing advantages of massive cloud-based infrastructure.

3. Genomic computing scenarios

Today, genetic testing technology is rapidly advancing. Already, many hospitals use this technology to treat complex diseases more quickly and accurately. The data size of a genetic sample from one person is already enormous, usually dozens of gigabytes. Moreover, the targeted genetic analysis generally requires samples from hundreds of thousands or millions of people, resulting in complex challenges for container storage solutions:

· Mining data from a large set of samples requires massive computing and storage resources. In addition, the data size snowballs, storage costs are high, and management is challenging.

· Massive data volumes need to be quickly and securely distributed to multiple locations in China for shared access from multiple IDCs.

· Batch processing of samples takes a long time and requires high computing performance with significant request peaks and troughs, which makes planning difficult.

Because of the features of genetic computing scenarios, we recommend the ACK + AGS + CNFS + File Storage NAS + OSS combo for the following advantages:

· The NFS’ built-in file storage class enables a fast, inexpensive, and precise genetic computing container environment.

· CNFS supports OSS PVs to save lower computer and post-mounting data and analyze result data for distribution, archive, and delivery. This way, a large number of users can upload and download data simultaneously, improving the data delivery efficiency. At the same time, CNFS also provides massive storage capacity and lifecycle management capabilities to archive and store cold data at a lower cost.

· AGS supports GPU accelerated computing of hot data for genetic computation, offering performance 100x better than traditional modes for significantly faster and cheaper gene sequencing.

Besides the typical scenarios above, CNFS offers optimized container and storage integrated solutions for services in many other applications. For more information, see https://help.aliyun.com/document_detail/264678.html.

Case Studies: CNFS and File Storage for a Modernized Enterprise Application

Leveraging its deep integration with CNFS, Apsara File Storage NAS (NAS) has become the best-containerized storage solution. Several case studies are described below to help you better understand how to leverage Alibaba Cloud ACK and file storage service to build a modernized enterprise application.

Video service

BAIJIAYUN is a leading Chinese comprehensive video service provider. During the COVID-19 pandemic, their traffic increased dozens of times over, and they had to expand in a way that was imperceptible to users rapidly. Additionally, BAIJIAYUN’s service scenarios feature massive read and write volumes and its computing clusters horizontally expanded to four in number. During the recording transcoding process, its previous storage system created an I/O bottleneck, placing significant limitations on BAIJIAYUN’s ability to deal with high traffic volumes and high concurrency.

applications and allow data access immediately after scaling. Ultimately, BAIJIAYUN chose Alibaba Cloud’s container service ACK and file storage service NAS, allowing it to optimize its container cluster architecture and successfully expand its capacity ten times within three days.

NAS supports automated scaling as needed based on ACK. As a result, several thousand containers can be launched quickly to help the elasticity of containerized applications ideally. NAS provides standard APIs that are compatible with powerful transcoding software to mount video editing workstations easily. BAIJIAYUN has exceptionally high requirements for Kubernetes cluster performance. Thanks to the high-performance NAS service, the solution ensured a high throughput of up to 10 GB, which cracked the I/O bottleneck while meeting the high traffic volume and concurrency requirements of BAIJIAYUN’s scenarios. As a result, the company’s live-streaming and recording services were launched smoothly during the pandemic.

Autonomous driving

The second case is a well-known customer engaged in the auto industry. As a leading Chinese smart car manufacturer, the customer is also an emerging tech firm at the forefront of internet + AI integration. Its products are powered by multiple AI technologies and services, such as speech assistants and autonomous driving.

The company’s challenge was the hundreds of millions of small images (100 KB each and more than 100 terabytes in total) used as training resources for its autonomous driving system. During training, the GPU must repeatedly and randomly access these images, which requires the file system to provide high file access IOPS to accelerate the training process. However, as the storage system grew in prominence, its stability and performance failed to increase at the same pace as its size. In addition, other issues, including high costs and complex operations, maintenance, and management, also arose with the increase in storage resources.

support for the customer’s HPC platform for smart driving, speeding up the random access to small files during training by 60%. Data is stored at multiple data nodes in the cluster to support simultaneous access from multiple clients and parallel extension. Moreover, NAS supports multi-level storage data flows, which significantly streamlines autonomous driving data acquisition, transmission, and storage processes.

Genomic computing

The last case study is from the genomic computing sector. The customer is a world-leading life science institution focusing on cutting-edge science. Challenges: The customer’s data grew fast, and their current storage system could not meet their capacity and linear expansion requirements, creating an I/O bottleneck in their genomic computing system. The high sample data size resulted in increased storage costs and complex management.

With Apsara File Storage NAS (NAS) mounted on a containerized cluster, the solution enabled high performance in genomic data computing and data analysis using shared storage. The storage stored the lower computer data and the post-mounting data and intermediate data generated during the process, ensuring low latency and high IOPS for the containerized storage. The storage performance increased from 1 GB/s to 10 GB/s, with end-to-end data processing, including data migration to the cloud and result distribution from the cloud, taking less than 12 hours.

NAS provides elastic bandwidth with high throughput. NAS divides the capacity as needed and allocates appropriate bandwidth to ensure service elasticity at a lower total cost of ownership (TCO) based on the service scale. NAS ensures efficient genomic computing at a lower cost with a uniform process and uniform scheduling of on-cloud and off-cloud heterogeneous computing resources.

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.