Alibaba Tech

Oct 7, 2020

4 min read

Alibaba Cloud LakeHouse: An Industry-Leading Next-Generation Big Data Platform of Alibaba Cloud to Integrate Data Warehouses and Data Lakes

Catch the replay of the Apsara Conference 2020 at this link!

By Alibaba Cloud MaxCompute

On September 18, at the Apsara Conference 2020, Alibaba Cloud officially launched the next-generation big data platform, Alibaba Cloud LakeHouse. It integrates data warehouses and data lakes and allows data to flow freely between data warehouses and data lakes for data computing, building a complete and organic big data ecosystem. This new-generation big data platform combines the flexibility of data lakes and the maturity of data warehouses. It helps enterprises reduce the overall cost of building a big data platform.

The big data technology has experienced the development of data warehouses and data lakes since the beginning of this century. The former usually refers to a big data technology-based integrated service provided by a cloud vendor, and the latter is a big data solution usually composed of a series of cloud services or open-source components.

When an enterprise is in the initial stage, flexibility is very important, so the data lake architecture is more applicable. As an enterprise matures, development becomes the most critical factor, so the data warehouse architecture becomes more suitable. Do enterprises have to choose one over the other when it comes to data lakes and data warehouses? Can a solution be available to integrate the flexibility of data lakes and the maturity of data warehouses?

Jia Yangqing, the Vice President of Alibaba Group and Senior Fellow of Compute Platform, said, “Alibaba Cloud LakeHouse integrates the flexibility and rich ecosystem of data lakes with the enterprise-grade capabilities of data warehouses. This allows enterprises to build a new computing platform that integrates data lakes and data warehouses. Alibaba Cloud LakeHouse not only supports large-scale machine learning and deep learning, but also helps enterprises efficiently improve their big data capabilities, achieve agile operations, reduce costs, and improve efficiency.”

Based on the original data warehouse architecture, MaxCompute integrates data warehouses that have a unified storage and computing architecture with data lakes that separate cloud storage and computing. This integration finally achieves the overall architecture of integrated data warehouses and data lakes. In this architecture, although multiple storage systems coexist at the underlying layer, an integrated encapsulation interface is provided for the upper-layer engine through a unified storage access layer and unified metadata management. You can join a table in a data warehouse with a table in a data lake. In addition, the overall architecture provides a unified mid-end to ensure data security and manage data.

In the process of technology integration, MaxCompute provides four key technologies: fast integration, unified data and metadata management, unified development experience, and automatic data warehousing, and continues to improve the core performance. In the TPCx-BB 100 TB (Intel Xeon Scalable Processor) benchmark of 2020, MaxCompute reduced the costs by 40%. In the TPCx-BB 30 TB (Intel Xeon Scalable Processor) benchmark of 2020, the performance improved by more than 50%, and the cost reduced by more than 30%.

Weibo was an early adopter of “Alibaba Cloud LakeHouse”. Previously, Weibo used Hadoop data lakes and Alibaba Cloud data warehouses, which are completely separated at the cluster layer, and data is unable to flow freely to support data computing. To solve these difficult problems, based on Alibaba Cloud, Weibo built an AI computing mid-end that integrates data lakes and warehouses. This mid-end avoids the huge burden of data migration and allows data engineers and algorithm engineers of Weibo to easily use Alibaba’s proven large-scale computing capabilities and algorithms to improve business efficiency. The MaxCompute cloud-based data warehouses (structured data) and data lakes (unstructured data) form a closed loop, which greatly improves the AI operation efficiency and produces great business value.

After nearly ten years of technical accumulation, MaxCompute, Alibaba Cloud’s proprietary cloud-based data warehouse solution stably supports the data storage and data computing services of Alibaba Group and is an important part of the big data platform for customers on the cloud. The release of Alibaba Cloud LakeHouse provides a more flexible, efficient, and cost-effective data platform solution for enterprises to build a new big data platform or to upgrade the architecture of existing big data platforms. This accelerates the digital restructuring of enterprises.

Catch the replay of the Apsara Conference 2020 at this link!

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.