The ZEEKR App System’s Cloud-Native Architecture Transformation Practice
Preface
New energy vehicles have become the key pillar for the re-emergence of China’s automobile market. With the rapid development of the new energy vehicle market, there are different brands of car manufacturers. ZEEKR is a new brand of high-end electric cars under the Geely Holding Group. In April 2021, ZEEKR released the first high-end intelligent electric car model, ZEEKR 001, which was well-received by the market. As of December 2022, the cumulative delivery volume of the 001 models has exceeded 70,000 units. It ranked first for three consecutive months in the deluxe pure electric models of ZEEKR, with a value of more than 300,000 RMB.
In addition to providing customers with excellent products, ZEEKR establishes connections with users through the ZEEKR app. The ZEEKR app has launched an online community, subscription travel, online shopping, ZEEKR Life, and other innovative modules to achieve the full lifecycle management of ZEEKR products and the full scene coverage of the user journey. Different scenarios are integrated into the app, including users that want to know about relevant models, users interested in purchasing and using cars, and users that want to share feelings and get quick solutions to after-sales problems.
“I didn’t know much about ZEEKR before. The ZEEKR app is very helpful to me. I think it is good. At the same time, I learned about the car I want to buy through the software. I have known ZEEKR for a year. I cannot only learn about ZEEKR cars but also use points in the app to change products. I hope ZEEKR can launch more practical products.”
It comes from an Apple App Store user review. The ZEEKR app can help car owners control cars intelligently anytime and anywhere and provide the ultimate travel experience that allows owners to purchase goods and join community activities. Users can obtain vehicle information at their fingertips and enjoy their travel with convenience.
The Practice of Cloud-Native Architecture Exploration
Cloud-Native Technology Development
With the rapid development of ZEEKR’s digital business, the IT technology behind it is constantly updated and iterated. ZEEKR attaches great importance to customer experience and regards system stability, iterative efficiency of business functions, and rapid problem positioning and resolution as the cornerstones of building core competitiveness. Liu Hao (Vice President of ZEEKR) said, “In order to respond to user needs quickly (such as shortening the manufacturing cycle of a car and upgrading the car operating system conveniently and smoothly), companies need to innovate from products to user experience to business models. However, the experience in the development of consumer Internet and traditional industries is not enough to meet the high requirements of industrial Internet for cost, efficiency, and quality. Cloud-native is a deterministic technological development trend, which can effectively promote industrial development and drive enterprises to actively innovate. ZEEKR will continue to invest and empower cloud-native capabilities in a wider range of business areas (such as research, production, supply, marketing, and three electrics) within the company.”
These businesses are compatible with the core capabilities brought by the cloud-native architecture. In the process of transforming the ZEEKR system to the cloud, around the cloud-native technology system, we promote the technological upgrading and transformation of each business line of ZEEKR and speed up the development process of digital intelligence. ZEEKR has followed two principles in the technical selection:
The first principle is to embrace open-source mainstream technical standards. It can ensure the maturity of technical solutions, make it easier to obtain technical resources and best practices from the developer community, and help enterprises recruit technical talents. In addition, such a strategy avoids being tied to closed technology systems and specific cloud vendors. The localization of software technology and independence and control are also points to consider.
The second principle is to make full use of the cloud. Non-functional requirements (such as stability guarantee, underlying technology implementation, technical component maintenance, and auto-scaling) are handed over to cloud vendors as much as possible so the technical team can devote more energy to business innovation.
These two principles are not contradictory. On the contrary, they can be well-integrated. This is an architecture selection standard that all enterprise users that use cloud computing can learn from utilizing. For example, Kubernetes is a typical technical standard that meets open-source standards. The Kubernetes products provided by Alibaba Cloud can simplify users’ construction costs and better integrate with cloud computing resources. At the same time, users can still use cloud products based on the standard protocols and APIs of open-source Kubernetes, which is the best embodiment of the integration of the two selection principles.
Business Containerization
Under the cloud-native trend, Kubernetes has undoubtedly become the infrastructure of the new generation of cloud IT architecture for enterprises. Since 2021, ZEEKR has started a microservice and containerization transformation plan to migrate the base of IT systems from virtual machines to Kubernetes.
In terms of the selection of the Kubernetes platform, based on the two principles of technology selection, ZEEKR chose Alibaba Cloud ACK. Based on the reliable and stable IaaS platform of Alibaba Cloud, ACK packages more than 30 cloud products to form a new interface for automated O&M and cloud platform interaction. This improves the elasticity and automated O&M capabilities of enterprise business systems.
Based on ease of use and integration capabilities of ACK, the containerization transformation of the ZEEKR IT system is smoother than expected. For each business system, the migration from virtual machines to Kubernetes is only a change in the underlying bearer without too much transformation cost. In the process of containerization transformation, when the ZEEKR Technical Team encounters difficult problems, it can obtain best practice guidance from Alibaba Cloud for the first time, including cluster planning, platform operation and maintenance, application adaptation, security protection, observability, and other aspects. This improves the speed of containerization transformation.
Systems (such as the ZEEKR app and SCRM) have been 100% based on Kubernetes. Compared with the traditional VM-based deployment methods, containerization helps ZEEKR improve resource utilization by 20% and O&M efficiency by 50%. In September 2022, it passed the maturity assessment of the cloud-native technology architecture of CAICT. At the same time, the ZEEKR Technical Team also mastered the ability to manage ultra-large-scale Kubernetes clusters in the process of containerization transformation and promoted the application of more cloud-native new technologies.
Unified Microservice Model
The unification of the microservice model take places at the same time as the containerization transformation. Before this, ZEEKR’s various business units coexisted with multiple technology stacks, and the communication complexity between each other was high. The handover of projects often required huge energy, which hindered the progress of digital transformation. Therefore, microservices model unification was imperative.
It took ZEEKR over two years to complete this arduous task. Although it required a lot of energy, the benefits are immediate and persistent. Whether it is an internal team or a tripartite ISV, there is a unified technical framework standard. After each team shares the technology stack, the research and development efficiency is doubled.
It is related to the IT strategy for many years. High openness, high maturity, and high popularity are indispensable when selecting a microservices model. Considering that ZEEKR uses Java as its main development language, Spring Cloud Alibaba has become the best choice for the microservice framework.
Spring Cloud Alibaba is dedicated to providing an all-in-one solution for microservice development. It contains the necessary components for developing distributed application microservices so developers can easily use these components to develop distributed application services through the Spring Cloud programming model. Some of these components are integrated into the code in the form of SDK, and some run independently in the form of middleware. The latter can choose managed cloud products to reduce the workload of developers. For example, Alibaba Cloud Microservice Engine (MSE) has improved the out-of-the-box registration configuration center Nacos and cloud-native gateway.
Stability and Efficiency Issues Become More Prominent
It can be expected that with the launch of the ZEEKR app, the number of registered car owners has grown explosively, and the use scenarios have been expanding. In this process, the user experience of the app has become more important. How to ensure the stability and agility of the app while the user scale is growing at a high speed and how to ensure the microservice development efficiency of the app have brought challenges to the research team.
Poor Business Continuity without Capacity Planning
Remote car controlling, online map, 3C mall, and other core app services have strict requirements for business continuity. It needs to be online 24/7. The app is facing the pressure of high concurrency and large traffic, especially in the face of peak season sales activities, the release of new models, sudden hot events, and other situations. There are often exceptions (such as functional failure, page failure, excessive delay, and even complete inaccessibility of the app), which have a serious impact on user experience.
Slow Feature Version Iteration
With the increasing demand for user scenarios, more features are waiting to be released online, and the requirement for iteration frequency is getting higher. However, due to the lack of end-to-end canary release capability on the app server, in order to ensure business stability, developers can only choose to release a new version at midnight (during the off-peak hours), which is a challenge, and the lossless release feature is required.
Technical Architecture Lacks Overall Design
At the beginning of the company’s establishment, in order to realize the rapid launch of the app, the overall design of the technical architecture was not considered enough, which was reflected in many problems (such as high coupling between services, long system links, different technical implementation standards, and an unreasonable selection of cloud products). For example, through research, it was found that the requested link of a core interface was too long, resulting in a high delay jitter rate and affecting user experience.
The R&D Team realized they will face more challenges with the development of the business. In the rapid development of the business, it is necessary to ensure the stability of the existing business, quickly iterate new features, and ensure that the efficiency of development will not be reduced with the growth of the business. After all, there is a problem that the pace of team recruitment cannot keep up with the development of the business. In summary, the key for the team to solve the rapid app iterative evolution is to solve the problems of stability and efficiency.
- Stability: After the number of users increases, the stability of the system becomes more important. In terms of users encountering abnormal errors at a specific time, a specific function point continuously reporting errors, and the system being completely unavailable for a period, these will affect the reputation of the product among users. Finally, this completely unavailable scenario may even become a hot topic on social media networks.
- Efficiency: As the number of users increases, the corresponding requirements increase, and the business scenarios become more complex. At this time, the internal test cannot cover all scenarios. You need to increase the investment in testing. Although there are more functional requirements, the speed of iteration is required to be faster because there are already many competitors in the market. One of the keys to competition is speed. The business, the pace of development, testing, and release should be faster.
To address these issues, the R&D Team optimizes and tunes microservice systems from traffic ingress to microservices and then from a global perspective based on the business architecture. It conducts in-depth microservice exploration around cost, stability, and efficiency.
Business Link Ingress Upgrade
The gateway architecture in the ZEEKR architecture is inconsistent, and various gateways have played a specific role. We can see from the figure that there are many gateways (such as traffic gateway, API gateway, and microservice gateway). They have the functions of security (WAF), API management, traffic distribution, etc. If a request link passes through multiple gateways, this brings certain challenges to both cost and stability.
At this time, MSE cloud-native gateways appeared in the vision of the R&D Team. Cloud-native gateways combine traffic gateways (Kubernetes Ingress and Nginx) and microservice gateways (Spring Cloud Gateway and Zuul gateways) to reduce 50% of resource costs, shorten request time, and reduce O&M complexity.
As a north-south public network gateway, it is common to use WAF to protect against abnormal traffic. As the Internet environment becomes more complex, users’ demands for protection continue to increase. The common practice is to connect the traffic to the WAF security gateway, filter the traffic, forward the traffic to the traffic gateway, and reach the microservice gateway. Then, after upgrading the cloud-native gateway, we need to consider if the security capability of the ingress traffic is still available.
The cloud-native gateway is directly connected to Alibaba Cloud WAF through the built-in WAF module. This way, the user’s request link can complete the WAF protection capability at the same time only through the cloud-native gateway, reducing the O&M complexity of the gateway, as shown in the following figure:
As the ingress of link traffic, the gateway has security capabilities and undertakes the management of ingress traffic/capacity and high availability.
Exploration of the High Availability of Microservices
Use Lossless Online and Offline to Improve the Stability of Microservice
The customer app uses a microservices model. The request failure rate increases and POD restarts continuously in scenarios (such as business release and auto-scaling). To solve this problem, combined with MSE product capabilities, through adaptive waiting and active notification during application offline, readiness probe during application online, and service preheating, microservices are released without loss, effectively avoiding traffic loss during the release process and reducing the risk of service access failure. In addition, the MSE traffic prevention and control capabilities are introduced to improve the overall stability of the service by using corresponding technical measures for core business scenarios (such as port throttling and degradation, MQ peak and valley cutting, and database slow SQL throttling).
Use Horizontal Splitting to Improve Business Auto Scaling Capabilities
With the rapid development of business, the problem of insufficient capacity under the original architecture of the ZEEKR app is becoming more prominent. Horizontal expansion cannot be carried out quickly in the face of new car releases, sales activities, and sudden hot spots. A large number of core business databases are placed on the same database instance, which is prone to all losses. The Alibaba Cloud Service Team recommends using PolarDB-X products to separate business databases one by one and split large business tables to solve the problem of large single tables. This improves the elastic scaling capability at the database level. In addition, to address the pain points of insufficient microservice elasticity, container elasticity solutions (such as multi-zone node auto-scaling, HPA, and CronHPA) are provided to improve the ability of core services to cope with traffic emergencies.
Traffic Protection and Fault Tolerance
Let’s imagine some downstream service providers encounter performance bottlenecks during peak business hours that affect business. The ZEEKR App Team encountered such a problem. In the process of an architecture migration, it encountered unexpected slow calls, which slowed down the system and led to a jitter in overall stability. How can they avoid such problems? You need to configure a circuit breaker rule for some non-critical service consumers. When the proportion of slow calls or the proportion of error calls within a period reaches a certain condition, the circuit breaker is automatically triggered, and the Mock result is directly returned for a subsequent period. This can ensure that the calling end is not dragged down by unstable services, give unstable downstream services some breathing time, and ensure the normal operation of the entire business link.
There are a lot of unexpected things, so how can we make the system highly available and make the system work on the optimal solution under uncertain conditions? The ZEEKR App Team tried to do microservice stability management on the large level of the app to avoid the overall downtime of the app. Then, they tried to sort out the core services and interfaces, find out the upstream and downstream, decouple and transform the strong dependencies, and confirm what reasonable parameters are configured for the core services based on the monitoring and observability data. After that, they tried to perform throttling and degradation configurations, drills, and optimization on the service many times, summarize the practical rules of the scenario, and formulate appropriate technical specifications.
Development and Testing Efficiency Improvement: Online Service Testing
After ZEEKR began to deploy, release, and test on the cloud, they encountered the following problems:
- After the application is deployed, is the application healthy? When there is a problem online, how can they quickly initiate a request and reproduce it?
- How can they quickly verify whether the historical functions are normal before the service is launched?
- Before the large version is launched, what is the impact of the modified content on the performance, and will the service pressure be too great after the volume is increased?
The R&D environment, test environment, staging environment, and production environment are deployed in different VPCs to achieve security isolation. If you build a test tool, you need to solve the problem of network interconnection between the test tool and different environments. The IT staff of the enterprise only wants a simple test tool, but it is far from over to solve the complex cloud network topology after cloud migration. In order to use the test tool in the office network, it is necessary to ensure that the test tool can be accessed by the office network. At this time, it is facing the test of network security.
Cloud service testing and stress testing are designed to solve this problem. With the elastic computing capability of FC, it can solve the problem of the network connection on the cloud and the problem of resource utilization to the greatest extent. With the content provided by the service contract, the service test function can automatically fill in the test parameters, and users only need to modify the values to initiate the test. You can connect service tests in series as prompted to achieve the purpose of automated regression and stress testing.
End-to-End Governance
End-to-End Canary Release, Enabling Release at Any Time during the Day
As the sales of ZEEKR cars become more popular, its registered users and daily active users grow rapidly, and more business scenarios and new functions need to be supported, with an average upgrade frequency of one small version in two or three days and one large version in half a month. In order to not affect the business peak during the day, each release can only be carried out during the off-peak hours of the business in the early morning. Let’s imagine that if the R&D/O&M personnel concentrate on the release at night, the work efficiency of participating in the release will be affected the next day. If fewer people are selected to participate in the release at night, hemostasis measures will probably be too late to be implemented when problems occur, and fault responsibilities will not be easily divided.
The Alibaba Cloud Service Team helps the ZEEKR Team develop and implement an end-to-end grayscale release solution. The Alibaba Cloud Service Team releases and cuts streams after the verification is completed vy deploying a canary version, and performing canary verification based on the traffic ratio or customer characteristics. This meets the requirements of customers for releasing small versions at any time during the day. Based on the scenarios that customers need to release multi microservices on the core business link and the MSE cloud-native gateway and traffic grayscale tagging to implement multi-service end-to-end grayscale scenarios, you can implement grayscale scenarios (such as CDN, gateway, MQ, configuration, and database). As such, you can implement multi-service daytime releases without the need to change any service code. At the same time, you can verify the multi-service through gradual traffic amplification. If a problem occurs, the traffic can be cut back in time, which reduces the stability risk that may be caused by the release during the day. At the same time, the transformation of the Apsara DevOps pipeline can help customers achieve the core business automation release and improve deployment efficiency.
Development Environment Isolation
The iteration of microservices has a lot of dependencies. Business developers cannot complete the development locally. They must use a complete set of environments to perform development and joint debugging. There are dozens of applications in the ZEEKR app system. If each development environment maintains a microservice environment provided by a complete set of app systems, it will consume a lot of manpower and resource costs.
The ideal logical isolation of the development environment should be like this. Based on the git-branch design concept, a stable baseline environment is retained. Development students of each branch can quickly pull up the feature environment to be developed through logical environment isolation. We only need to maintain a complete set of baseline environments. When adding a feature development environment, we only need to deploy the changed applications involved in this feature separately, instead of deploying a complete set of microservice applications and their supporting facilities in each feature environment. The baseline environment contains all microservice applications and other facilities (such as service registries, domain names, SLB, and gateways), while the feature environment contains only the applications that need to be modified in this feature. As such, the cost of maintaining n sets of feature environments becomes an addition instead of the original multiplication, from n×m to n+m. This is equivalent to adding a feature environment at zero cost, so we can safely scale out multiple feature environments. The ZEEKR Team uses the end-to-end grayscale solution in microservice governance to implement traffic swimlanes. This way, the team can quickly build an isolated development environment, improve R&D efficiency, and reduce costs.
End-to-End Stress Testing and Tuning
In order to find out the actual concurrent capacity that an app can carry, it is necessary to perform multiple rounds of end-to-end stress testing and tuning for core service interfaces. System capacity evaluation, optimization, and protection are summarized as stress testing, observability, throttling, and scale out. The construction of a highly available system must come from practice. The ZEEKR Team conducted a performance survey of the app service capability through stress testing to evaluate whether the performance is acceptable. If the performance is unacceptable, you need to scale out and optimize. If the performance is as expected, you must configure the corresponding throttling rules to prevent service failure if the traffic exceeds the expected traffic.
During the entire stress testing drill, you need to press, watch, limit, and scale the data while continuously receiving feedback and adjusting the data and establish a system that ensures high availability of the business system. The end-to-end stress test lets everyone know the performance and capacity of the app system and enhances the confidence of the whole production system to upgrade to the cloud-native architecture.
Outlook
The ZEEKR app has explored the upgrading of cloud-native architecture, improved the stability and agility of the C-side business system, and provided solid technical support for higher sales targets. This is just the beginning of exploration. With the deepening of the cloud-native architecture, the usability of the business will continue to increase, thereby offering a better travel experience and more fun to the car end user.