NetCraft: a Master Platform for Global Network Maintenance and Upgrades
This article is part of the Academic Alibaba series and is taken from the paper entitled “Automatic Life Cycle Management of Network Configurations” by Hongqiang Harry Liu, Xin Wu, Wei Zhou, Weiguo Chen, Tao Wang, Hui Xu, Lei Zhou, Qing Ma and, Ming Zhang, accepted by SIGCOMM 2018. The full paper can be read here.
If a network were a body, the process of managing the configurations of network devices could be likened to a combination of the nervous system and the immune system.
Like the nervous system, it aims to ensure each device functions properly and integrates seamlessly with other devices. Like the immune system, it seeks to maintain proper functioning over time, keeping the network secure in response to new threats.
In a nutshell, this means ensuring proper network configurations and managing the life cycles of those configurations.
Proper device configurations keep a network’s reliability, cost efficiency, and security in balance. Currently, highly specialized network operators have to manage configurations and their life cycles manually to maintain a smooth operation.
This is a major task for two key reasons. Firstly, configurations are frequently updated to mitigate incidents, adapt to network changes, and accommodate new application requirements. Secondly, frequent checkups and diagnoses are required to mitigate the very real risk of configuration updates jeopardizing the network as a whole.
This human involvement may slow down the network management process, but full automation is not possible due to the lack of a unified network model that abstracts network configurations.
Researchers at Alibaba, however, have made a step towards this once-impossible goal. Alibaba’s tech team has created NetCraft, a framework that can expressively encode all parts and protocols in a given network. The generation, update, transition, and diagnosis of configurations are performed by software automatically. Network operators simply use this framework to implement changes or network updates.
Creating a unified network model presents a set of unique challenges. A model must describe all parts of network configurations. To achieve automation, the model must be able to be freely translated into configurations or be constructed from existing configurations. The model must also be able to describe fine-grained operations in network configurations, and it must easily deactivate, activate, or undo any configuration module in order to perform network updates and smooth transitions.
Complicating things further, the model must not require any cooperation from device vendors to maintain its speed, since standardization among vendors is usually slow.
To address these challenges, NetCraft distinguishes between different network layers and protocols. It then breaks down configurations into reusable modular templates and associates them with a part (node, interface, edge, or property) of the network model. This supports interoperability, since each of these parts can be easily converted back and forth with a configuration module. During network updates, the scope of each configuration module is limited to simple operations, and each module also has several shadow modules for deactivating, activating, or reversing its effects to the network.
Additionally, it maintains independence by directly dealing with vendor-specific configuration semantics, avoiding the need for semantic standardization.
Putting NetCraft in Action
Researchers built an initial version of NetCraft in and deployed it in Alibaba’s global WAN. Alibaba has a global-scale infrastructure that supports various types of online services such as online retail, cloud, and mobile payment. Billions of people use these services, making it the perfect platform for NetCraft to prove itself on the global stage.
Over the course of a year, evaluations showed that NetCraft reduced network incidents caused by configurations by 95%, and cut the average time to plan and execute a network update by up to 93%. The average time to onboard a device onto the WAN was cut by 83%. From this initial experiment, researchers are working to push NetCraft to the next level, automatizing transition planning, scheduling multiple updates, and perhaps moving closer to the dream of full automation.
The full paper can be read here.