The Achievement of a Computing Lifetime: Alibaba AI•OS 10 Years On
How a set of five basic components became the heart and backbone of Alibaba’s online services
Alibaba’s AI•OS, a big data-based deep learning online service system, stands today as the monumental result of over ten years’ rigorous development from the group’s engineering, algorithm, and efficiency experts. As the backbone for all search and recommendation services throughout the Alibaba e-commerce ecosystem at home and abroad, AI•OS is active on the main stage of data operations at all times, with the majority of the group’s transactions occurring under its guidance. Additionally, as the key framework for Alibaba’s enabling platform technologies, it provides an infrastructure for the group’s operations as a whole, playing a key role in its Alibaba Cloud services, Youku online video platform, Cainiao delivery services, Hema connected supermarket, and DingTalk workplace chat system, among other applications. Most importantly, the AI•OS system’s cloud product matrix serves developers worldwide, and is expected to contribute tens of millions of yuan in revenue this year.
In celebration of this achievement, Alibaba’s Distinguished Engineer of Search Business Unit Shen Jiaxiang recently collected his observations on the past decade’s achievements, presenting his vision for how engineering can drive business innovation through the effective deployment of data and algorithms. In this article, we share his insights on this milestone accomplishment and the upgrade it has brought to Alibaba’s search engine and recommendation systems, largely owing to his team’s readiness to adapt and transform to become responsible for development of Alibaba’s crucially important enabling platform technologies.
Adapting Key Components: Tracing AI•OS’s Evolution
Over the course of ten years in development, AI•OS has undergone considerable changes to serve an increasingly diverse range of applications, ultimately becoming ubiquitous in a majority of Alibaba’s operations.
Most fundamentally, AI•OS focuses on online services for deep learning. While some of its components serve specialized functions, such as Jarvis for deployment to run on mobile phones, AI•OS’s key functionality rests with just five key service components. These are the TPP recommendation service platform, the RTP deep learning inference engine, the HA3 search recall engine, the DII recommendation recall engine, and the iGraph graph query engine. The primary algorithmic scenarios on AI•OS, including the Taobao mobile app’s search function and smart recommendation systems, can all quickly assemble and deploy these components to assume experimental streams in a computation operator flowchart customization mode. This allows online services to simultaneously receive updates and carry out training without interfering with the training model, effectively setting a new bar for high-level iterative efficiency.
The key service components comprising AI•OS are powerful in large part because they can evolve to serve a diverse range of algorithmic scenarios and technology products, where simple mechanical combinations would not suffice. That the engine’s foundation is graphical, especially in its ability to quickly assemble and deploy concurrent streams, is owing to efforts in generic abstraction for big data online services, which require final consistency for data updates happening in a timeframe of just seconds. Specifically, this depends on the Suez online service framework, which unifies work happening in three dimensions: index storage (for full-text search, graph search, and deep learning models), index management (for full, incremental, and live updates), and service management (for ultimate consistency, traffic switch, scale up/down, repartitioning, and related functions).
Realizing iGraph and other service components in all three dimensions has demanded a minimum of three years’ concentrated effort each, even where most code is shared with other components. Even so, preparing the components is only one of many requirements for online services, given that frequent business iterations are bound to occur at the graphical computation level. Most recently, the team has needed to commit every effort to migrating iGraph to the Suez framework without regard for cost to see AI•OS through to final completion.
Wading deeper: Hippo, AIOps, and end-to-end intelligence
Within the AI•OS system, Hippo is responsible for scheduling physical cluster resources. It is in this area that enabling platform containers and isolation technology meet search function engineering. Hippo is also a bridgehead where model training framework PAI-TF and real-time computing framework Blink can become systemically acquainted by way of aspect-oriented programming (AOP). Today, recommendation and search training tasks run on Hippo’s co-location resource pool. During the algorithm’s heyday, it ran on as many as 2,000 units of hundred-core machines running at full capacity, with a seven-day average of 1,300 units. These resources were obtained for free, and the value that these jobs created is beyond estimation.
AI•OS itself is also a setting where prediction and optimization algorithms play a full role. Among these, AIOps is a great aggregator. After metrics service KMon solves the second-order real-time reliability, TPP can successfully raise the load limit of ajdk. After the successful elastic scaling of stateless service components, AIOps can finally further promote the implementation of flexible strategies for most engine service components in the Hippo pool, and even the momentous scale of operations on Alibaba’s 11–11 shopping festival struggled to reach 50 percent of peak load.
AI•OS is able to complete the algorithm iterative closed loop in its own system thanks to two jewels embedded in Mobile Taobao: the search backend, and the device. This amounts to a complete fusion of algorithm engineering products with main stage activity between involved parties. Efficient product iteration and a perfected experimental mechanism, coupled with the framework’s support system, continue to fulfill pursuits of new horizons. In recent years, explorations of device intelligence have gradually become clearer, helping Alibaba’s Pailitao image search function achieve tens of millions of UVs, for instance, and feeding back into Mobile Taobao on a technological basis to open new areas for development in the AI•OS system.
Developing AI•OS through Productization
Among other important contributions, AI•OS’s deep-rooted productization allows the Alibaba group to be the backbone of its own enabling platform technologies. TPP, TisPlus, and OpenSearch, which are highly targeted enabling platform products for recommendation and search functions, make both the big data scenarios and basic search services at the core of many business units possible. In context of globalization, this means the AI•OS system does not require customized development for worldwide deployment. Thus, the enabling platform has a distinct technical advantage.
Expanding the cloud is not only an opportunity but also a core mission and ultimate destination of AI•OS’s productization. OpenSearch and ES (infrastructures based on the AI•OS system) have now been deployed globally and grown into search products driving tens of millions of yuan in revenue. Soon to follow, a smart recommendation product called AIRec is being readied for the market, and next year Alibaba’s public cloud big data product matrix is expected to achieve a significant revenue breakthrough.
In review, the cornerstone of the AI•OS system is Hippo, which defines the hard boundary for resources in the system. These resources are necessary for developing online services. Anything that supports co-location to form a win-win situation in the resource perspective is considered a friend of the system (as, for example, PAI-TF is). At present, Alibaba is also working to expand Hippo’s boundary, which will then be merged with Yarn and even its pool. On top of Hippo is Suez, which is the basic framework for big data online services in the system. Anything which supports Suez amounts to a functioning member of the system, not only greatly reducing operation and maintenance costs but also participating in the flexible scaling of AIOps to further enhance system efficiency. Furthermore, members with graphical capabilities become core members of the deep learning online service system, and are free to expand into all possible business scenarios. In the future, the group hopes improvements in the full-graph engine and offline efficiency can match improvements in the iterative efficiency of algorithms.
From Hippo and Suez (iGraph) to graphical engines (RTP, HA3, and DII), search backend to the Mobile Taobao app, and even AIOps and major technology products such as TPP, TisPlus, and OpenSearch, the core of all these developments is the iterative efficiency of optimization algorithms, which is in turn the essence of the AI•OS system.
Rather than being a standalone system, AI•OS has a strong relationship with a number of complementary frameworks to deliver the functionality and continued development the Alibaba Group requires of it. Viewed individually, these connections can help to illustrate the trajectory of AI•OS and Alibaba’s operations in their present state, as well as pointing toward future developments.
AI•OS and algorithms
In responding to big data business challenges, AI•OS is able to play a role of at most 30% in any solution, with algorithms handling an additional 30% and products and opportunities accounting for the remainder. However, the 30% done by AI•OS is an essential prerequisite for those other solutions. Unfortunately, this has often gone overlooked, as happened in the early days of Taobao’s search function and more recently with Mobile Taobao product recommendations. Few technical fields present the kinds of circumstances that surround AI•OS and algorithm development, where the iterative efficiency of optimization algorithms determines the outcome of any scenario.
AI•OS and Blink
On its way to becoming a universal real-time computing engine, Blink underwent extensive incubation inside the early AI•OS framework. The relationship between these two technologies hinges on the concept of real-time computing, as engine services in the AI•OS system all require consistent data updates at intervals of several seconds, while Blink is ideally suited to the technical challenges AI•OS scenarios present. For these reasons, Blink developers place a high value on AOP, while AI•OS developers strongly advocate for Blink in co-location, implement it in Hippo, and merge it with Yarn and its pool. The complementary features in AI•OS and Blink are second only to AI•OS itself and key algorithms, in terms of importance to the Alibaba ecosystem.
AI•OS and PAI
At one time, PAI was intended to operate independently, which proved impossible due to its incompatibility with the rigid demands of the AI•OS system — especially those of Hippo’s co-location resource pool, despite its potential to play an important role between Blink and AI•OS. Fortunately, the three related development teams were able to reach a consensus on how work should be divided to that end. After forfeiting its own resource pool, PAI-TF successfully supported all model training tasks for search and recommendation algorithms, and also supported AI•OS’s graphical execution engine. In the future, PAI-TF will play a larger role in the core cue of AI•OS development.
Comparing Blink and PAI, and reviewing AI•OS’s development trajectory, a clear pattern emerges: AI•OS first served as a development infrastructure for customers in Alibaba’s top consumer group before undergoing productization to serve medium and long tail customers, finally undergoing further productization to evolve as a cloud service. Blink was born from AI•OS and served top customers well in efficiency optimization for real-time computing. SQL then developed to guide productization efforts for customers in the medium and long tail, also contributing to unification within the group and, more recently, development for the cloud. In contrast, PAI only proved useful for serving customers in the medium and long tail group. Moreover, it was by no means self-sufficient, given that a number of top group customers had to rely on their own training platforms. The main reason for its shortcomings is that PAI was not able to support the iterative needs of top group customers. Today, PAI-TF is evolving to suit the AI•OS system, which will cause a substantial change in the paradigm. When fully implemented, PAI will be able to serve both top and medium and long-tail service capabilities, and following from this the group’s unified deep learning training platform should emerge naturally.
Zooming out: AI•OS and Graph Computing
As a part of numerous theories applicable to offline scenarios like iterative computing, graph computing is emerging as a leading field in computing engine science. Whereas the pursuit of faster verification in the field of online services is a given, classic benchmark implementation in big data technology is much rarer. As to why this is happening, one possible reason is a lack of sufficient technical capabilities in the industry. The corresponding academic craze is more understandable, as graph theory is such a classic of computing that established experts are bound to be captivated by it, while the lack of benchmarks in the industry will also tend to stimulate fervor among researchers. Nevertheless, most big data business scenarios are not typical graph computing issues once completely abstracted. For example, abstracting AI•OS yields the rapid customization of a computation flowchart, which is at most a generalized graph computing model.
Still, on top of the AI•OS system, traditional graph computing technology does present considerable room for growth. iGraph and even the whole system are ripe for replacement, but before they can be overturned online services with ultimate consistency in second-level data updates need to be fully understood, from Hippo to Suez. Should technology be integrated directly into systems and quickly implement on iGraph or Suez? Should it be made compatible with systems like PAI? Or should it be independent from the AI•OS system and built scratch? These choices themselves determine the results of such efforts.
Online analytical processing (OLAP) is similar to graph computing, and faces similar dilemmas in the process of going online. Such online services defined by ultimate consistency are independent from AI•OS in construction, indicating a need to open up an independent resource pool and provide sufficient unique value. This essentially defines the last frontier for AI•OS as online transaction processing (OLTP). Since its requirements for data update consistency are even higher, this is something that cannot be attempted with anything but the utmost focus.
Importantly, graph embedding (popular within and outside Alibaba) is unrelated to graph computing from the perspective of online services. This technology is called vector recall, and is a generalized application of image retrieval. This technology’s implementation within the Alibaba group is most prominent in the Machine Intelligence Lab at Alibaba’s DAMO Academy and has become part of the AI•OS system’s capabilities.
Having become an indispensable support for operations throughout the Alibaba ecosystem, AI•OS’s outlook for future development is limited only by the scope of the Alibaba Group’s ambitions for big data online services.
(Original article by Shen Jiaxiang沈加翔)