To System Architects: How to Design a System Better

A personal course and guide on how to better design a system from an experienced Alibaba system architect.

Image for post
Image for post

As a system architect, also spelled systems architect, myself at Alibaba, I have always felt that teaching about system design is actually far more difficult than teaching Java programming skills. In fact, I have long felt that a class concerning system design can easily become a class where we solely discuss theories which came to disappointing results.

I’ve often heard people say that even anyone without much experience in system design could easily teach how to design a system, but this is far from the truth. However, because of such a saying, I always felt a bit discouraged to be a trainer in this field. However, some situations over this past year gave me the courage to provide a private training program internally, which turned out to be popular. I think that I was the one who got the most out of this training program: I got to organize my ideas, learn a lot from my interaction with trainees, and better abstract and summarize methodologies of system design. It can be said that the courses of system design were jointly created by me and my trainees, given all the great feedback and input I received.

I set the following goals for system design training:

  • Master the conceptual framework and understand the routine of system design. System design — also referred to as systems design or systems architecture — is not simply about drawing the boxes of a structure chart. You must follow certain procedures to better design the system.
  • Broaden your knowledge base. In systems design, you must consider things comprehensively to better balance the trade-offs. Therefore, it is important to broaden your knowledge by means of system design training.

To achieve these goals, I will analyze the following issues before giving the class:

  • What is the conceptual framework that I want to present?
  • How can I enable trainees to better master and use the conceptual framework without dwelling on theories alone?

In the past, I did not think carefully about the conceptual framework of system design. In fact, many system design templates are a type of conceptual framework, but they are difficult to apply if you do not understand them. In this blog, I hope to help you not make the same mistakes I made in the past.

Routine of System Design

Purpose of system design > goals of system design > goal-centered core design > design principles formulated based on core design > detailed design of each subsystem and module

Purpose of System Design

Goals of System Design

Goal-centered Core Design

Design Principles Formulated Based on Core Design

Detailed Design of Each Subsystem and Module

At its core, my idea on how to enable you to master and apply the conceptual framework of system design is to share — during my elaboration of each step — my past errors and practical experience in system design. For me, I think that only architects with a lot of practical experience are capable of providing training courses in system design.

In the application part of the overall process, the method adopted here is to let everyone elaborate on his or her own system by following the same conceptual framework, to align with each other through interaction, and to make this practice a habit.

After a brief description of each step, I would like to elaborate on the first three steps.

Step-by-step Elaboration on System Design

Purpose of System Construction

System design is the process of designing a new system or significantly transforming or upgrading the architecture of an existing one, which serves a definite purpose. By analyzing the purpose of system construction, you can avoid problems from the gecko. System construction should reflect the challenges or problems at the service layer or system user layer, rather than meet your own personal requirements. Through purpose analysis, you can also ensure that the purpose is achieved in subsequent steps throughout the system design process.

By analyzing the purpose of system construction, you can easily identify a specific patterns and the depth of the system, which are two seemingly shallow but actually two rather down-to-earth concepts that show the impacting scope of a job that you have done. The impact may involve anything from the team in which you are a member, or the department, or the business unit (BU), business group (BG), or the cross-BG business module, or that entire group, or even the society where you belong. The norm here is to seek the truth behind the facts. That is, I do not recommend that you indulge in talks about theories while being at a loss for what to do in system design. Rather, I recommend that you get our hands dirty in the design of it all.

I conclude from my early experience in developing the high-speed service framework (HSF) that my lack of a clear purpose of system construction led me to make several serious errors. A typical error occurred during my effort to restructure HSF for a dynamic system architecture. If I had analyzed the purpose and intent of my effort at that time, I would have found that my work was solely out of my own personal intention to delve into the technology, rather than a need to solve the service challenges or problems facing HSF users. This error of mine relates to the issue at the beginning of the entire process that I talked about earlier. It is also found in many technical engineers when they initiate major system restructuring solely to meet technical demands. I used to be under the mentorship of an Alibaba senior executive who advised me to always clarify the purpose and cause of a job before engaging in it. This helped me realize that I can get rid of many complex technical details by clearly understanding the motivation before starting the work.

Due to my experience in HSF and Alibaba HBase development, I was able to better grasp the purposes of my subsequent work in developing Alibaba Cloud’s Container Service, as well as scheduling services and active geo-redundancy applications. At the same time, I was also able to orient my work toward the need to solve the service challenges of Alibaba.

In most cases, the need for system design is proposed by other people outside of the system designer him or herself. With that said, the architect should be able to convert the need to determine to a design, that is, he needs to know whether to build a new system or reconstruct or upgrade the old system based on the demand or request, and he or she also needs to be able to understand the purpose of system construction. Only by doing so can the architect be able to explain the motivation of system design to the technical team to let them understand the value and meaning of system design.

In my opinion, we need to clearly analyze the purpose of system construction before conducting system design to ensure that system construction is of value and meaning and that the goals of system design can be achieved.

Goals of System Construction

Let me provide two examples:

  • In 2011, my team started to build a containerized system to deal with expected increasing server costs, with the goal of supporting the same service volume by using half of the existing servers.
  • In 2013, my team started to build an active geo-redundancy system to improve the disaster recovery capability of services. Later, we found more benefits of active geo-redundancy that were not planned in the construction purpose of initial system design. For example, active geo-redundancy enables the elastic leverage of cloud resources and accelerates the evolution of infrastructure technology. In the initial design stage, we set the goals of deploying services in multiple locations in China (with the distance between locations being more than 1 km) to carry traffic and switching traffic from location A to B within 30s.

With clear and measurable goals of system construction, we can:

  • Orient system design toward the goals in a targeted manner and avoid deviations.

That is, you want to build a system to monitor the system construction effect. This is the most important piece of the puzzle of system design, but it can be easily ignored. For example, we demonstrated that the number of servers supporting services in containerized clusters and the volume of supported services, and we also compared these two figures with those of uncontainerized clusters. We also built a control system for active geo-redundancy to show the system deployment status and traffic switching status. We can ensure that the constructed system satisfies our initial motivation only by building a system to monitor whether the goals of system construction are met. Otherwise, we may run counter to the purpose of system construction after the system is implemented. Therefore, the monitoring system must be built during system construction.

The process of formulating goals based on a purpose is not theoretically complex but is easily ignored, resulting in problems during system design. The key is to formulate measurable goals and build a system to monitor the progress towards achieving the goals.

Core Issues Related to Goal Achievement

I would like to talk about this topic based on my experience.

We had a clear purpose of HSF development from the start, with the measurable goal of supporting roughly hundreds of millions of service calls at a time. However, due to our limited technical skills at that time, the core issues that we refined greatly differed from the actual situation, which led us to constantly restructure and modify HSF. Therefore, I will never believe that a person with poor technical skills can be a good architect because it takes solid technical skills for an architect to draw boxes of a structure chart in a rational way.

The process of drawing may appear to be simple on the surface to people who only pay attention to the boxes. During the initial HSF design, we identified the core issue of how to implement a user-friendly and service-defined Remote Procedure Call (RPC) framework, but we ignored the issues of how to support hundreds of millions of interactive calls and what problems (such as complex troubleshooting) would occur in service R&D after servitization. For example, the issue of intermediate load balancing that was identified only after HSF was launched led to redesign of the HSF structure, and this issue would have been identified from the start by an architect with a broader breadth of knowledge.

In hindsight, to achieve the goals of HSF design, we need to solve the following core issues:

  • A user-friendly RPC framework with support for hundreds of millions of service interactions
  • Inter-service software load balancing
  • Troubleshooting of service interaction

In hindsight, we formulated a clear purpose and related goals and properly refined issues in the T4 (containerization) phase, during which we solved the core issue of how to run 20 applications on a server. Most of the problems that occurred during T4 were related to the design scheme for the core issue. I will discuss this in the next topic about system design based on core issues.

We also formulated a clear purpose and related goals when developing active geo-redundancy. Although we tried to simultaneously carry traffic in multiple cities in China and to switch traffic within dozens of seconds, we could not eliminate the network latency that results from the physical distance present in active geo-redundancy applications. To enable dynamic traffic switching in active geo-redundancy applications, we need to solve the following core issues:

  • How could we split traffic and implement the entire request processing process within the local region?
  • How could we ensure data consistency after active geo-redundancy is implemented?

The conceptual framework of system design had been maturing over the recent years of our effort in unified scheduling, so we have a clear purpose and goals in this aspect. In light of the actual situation we encountered, we needed to solve the following core issues to implement unified scheduling:

  • How could we implement an online service resource scheduling system to meet all kinds of resource demands?
  • How could we expand the unified resource pool as much as possible to solve pool-related problems, such as resource competition, resource theft, and different resource specifications?
  • How could we interconnect the online service scheduling system and the offline task scheduling system?
  • How could we solve resource competition when online services and offline tasks are deployed in hybrid mode?

The preceding cases indicate that technical skills are required to solve the core issues when measurable goals are mapped to the technical level. For projects and products of the engineering type, engineering experience is also essential in solving these core issues and is a direct measurement of an architect’s competence.

Image for post
Image for post

Design for Solving Core Issues

I will talk about how to design to solve core issues based on my cases. My previous experience reveals that I made quite a few errors and encountered many complex trade-offs in system design, but my experience also allowed me to gradually understand the capabilities that a competent architect should have.

HSF Design

The first core issue to solve in the initial stage of HSF design is to build an easy-to-use RPC framework able to support hundreds of millions of service calls per day.

I brushed away the issue of ease of use and thus made an error in the initial version, but fortunately it didn’t cost much to correct the error. In this version, to publish a Spring bean as an HSF service or to call an HSF service, I needed to write a file to describe the service to publish or call, and put the file in a directory of JBoss. Although this method does not seem to intrude into the coding process, it results in a series of complex issues, such as where to write the file and how to automatically put the completed file in the directory during deployment. This method is modified to publish and call services by using a Spring bean in the second version. Although the modified method results in HSF-dependent service code, it standardizes the maintenance and deployment processes. Therefore, to design properly, we need to give comprehensive consideration to not only how to implement a system but also how the system will be used and how to run and maintain the system.

The second error that I made in HSF development is related to the RPC framework able to support hundreds of millions of service calls per day. This error taught me the biggest lesson in my coding career and even totally changed my technology selection style in subsequent design projects. Before engaging in HSF development, I had never built a system with more than 1 million visits per day and had no clue how a system with hundreds of millions of visits per day would be any different. JBoss Remoting was selected as the communications framework in the initial version of HSF simply because we used JBoss as the web container. However, this version experienced a serious fault when a major system went online, causing the response of the entire website to greatly slow down. We could not identify the cause after a whole day of troubleshooting and had to perform rollback, but we were sure that the fault must have been caused by HSF launch. We identified the cause one week after rollback. JBoss Remoting specified the default timeout period for remote calls as 60s, while the backend system was slow when processing some services, which caused the shared processing thread pool to fill up and thus slowed down the website response. This error made me realize that, to design a system with a high access volume, I need to be clear about the processing mechanism of the system because minor problems may escalate uncontrollably and cause faults under the burden of heavy access. Therefore, a system with a high access volume has high requirements for controllability, which makes such systems different from other systems with average access volumes. Controllability does not mean that you have to write the entire code but that you must be clear about the open-source code logic if used. To correct this error, I wrote a dedicated Mina-based communications framework for the HSF to process the connection method and thread pool in a special way. I have followed the principle of technical controllability during my subsequent projects of HSF transformation and other technical transformations.

When I talked about core issues previously, I mentioned that the refinement of core issues during HSF design was problematic, causing extensive rework on load balancing and post-servitization troubleshooting. These errors can be avoided and are no longer made by developers of service-oriented frameworks.

Load balancing was implemented by hardware load balancers in earlier versions, which caused several problems. One problem was that the virtual IP address of the service to call must be configured, for example, by using a central configuration server. Another problem was that HSF uses persistent connection, which further caused complex problems when the virtual IP address was used to connect to a backend cluster. For example, when the backend cluster published the restart action, the distribution of connections could become highly uneven, thus causing faults.

In addition to the preceding two problems, another problem that led us to transform HSF is that the hardware load balancers that were then used had reached their maximum traffic capacity, which was bound to happen and would crash the website. To remove this high risk and solve the preceding two problems, we decided to thoroughly transform HSF by designing software-based service registration, discovery, and addressing, which is the typical structure of a service framework system.

In hindsight, the defective load balancing function results from our non-comprehensive consideration of a system with a high access volume.

Our initial design did not take into account post-servitization troubleshooting. As a result, we had to invest a lot of manpower in fixing problems, but with low efficiency. To improve troubleshooting, we tried Dapper of Google but spent a lot of time on implementation.

Other HSF design problems exist. For example, the earliest communications protocol did not have the version number, which complicates compatibility processing during subsequent upgrades. A tougher problem is multi-language support.

HSF was the first core system with a high access volume that I had ever designed. I made numerous errors due to my insufficient technical skills at that time, causing countless revisions, faults, and repairs, but I made great improvement. In retrospect, I am grateful to my supervisor who gave me great tolerance and support. My experience in HSF design makes me realize that an architect must have deep technical skills at technology selection, extensive knowledge of design schemes, and comprehensive consideration to development, deployment, operation, and O&M to solve the core issues of design.

T4 Design

I did not have many difficulties in refining the core issues of T4, but made stark errors in the design that aimed to solve the core issues.

To run more application processes on a server than on a virtual machine, we isolated processes by means of hacking. Though applications were still functional, the hacking method made enumeration difficult after the applications were launched within a small scope with an established user base. We solved this problem only after we found LXC.

We also encountered many similar problems in the T4 phase, such as how to control disk space limits, which were handled by using the image method at first. However, the image method was not friendly in the case of disk space overselling, so we resorted to the dir quota method, which took us more than a month to implement because we had to write .cp files for online applications.

The errors I made in HSF design were mostly due to my lack of deep technical skills, while the errors I made in T4 design were mostly due to my narrow technical perspective at that time. I believe that a competent architect should have an open perspective in the technical field of interest, with sufficient knowledge in engineering and academics in this field, so as to select proper technologies to achieve the purpose and goals of design under certain constraints. I once wrote an article about how to expand our technical perspective.

Design of Active Geo-redundancy

In retrospect, my design of active geo-redundancy was more of a process of making choices based on my previous experience, with a controlled error rate. Therefore, I will talk about how to make trade-offs under constraints to solve the core issues of active geo-redundancy design.

The core issues of active geo-redundancy design are request closure and data consistency. To solve them, we referred to some engineering cases but found that our situation was quite different.

Here I will present some of the choices that had to be made for designing active geo-redundancy, so that you can think about and discuss them. I will not talk about my selection of logic.

  • What are the rules for traffic or data splitting? Should splitting be based on buyers, sellers, or products?
  • What is the relationship between the traffic offloading rules and the database and table sharding rules? Is the relationship loose coupling or strong binding?
  • Should we synchronize partial or full data?
  • At what plane (such as CAP) should we ensure data consistency?
  • Should we deploy active geo-redundancy in double- or triple-region mode and how should we determine regional distribution?
  • How long does it take to implement active geo-redundancy, one year, two years, or three years?

If you are interested in active geo-redundancy design, you can search for more information on the Internet.

In the future I will elaborate on my work in designing unified scheduling systems and cloud-based systems for the past few years until recently.

Summary of the Capabilities of a Competent Architect

  • Understand service challenges and map service challenges to technical challenges, or abstract technologies.
  • Possess the required knowledge and give comprehensive consideration to the entire process from development, deployment, and operation to maintenance.
  • Possess the capability to select technology, deep technical skills, and an open technical perspective.
  • Make trade-offs under constraints based on well-reasoned principles.

With all of this said, I think that “architect” is not a generalized title. It takes a lot of effort as well as long-term practice, and also an accumulation of much experience before someone can become a qualified architect, especially an engineering architect.

As far as I’m concerned, system design is the most complex topic to give training in for me in my work. I am grateful to my colleagues who have attended one of my system design training courses before and have helped me with my writing and discussion on system design. Through all of this work, I’ve realized that system design is actually a rather down-to-earth job that requires skills which can be trained and learned over time.

(Original article by Lin Hao林昊)

Alibaba Tech

Written by

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store