Data lakes or data warehouses? Alibaba puts forward a new take on big data architecture: integration of data lakes and data warehouses to provide the data lakehouse solution
Authors | Tao Guan, Ruibo Li, Lili Sun, Liangmo Zhang, and Yangqing Jia from the Computing Platform Division of Alibaba Cloud Intelligence Business Group
Bo Huang, Yumei Jin, Qian Yu, and Zizheng Liu from the Machine Learning R&D Division of Sina Weibo
As the concept of data lake has emerged in recent years, data warehouses and data lakes have been continuously compared and even heavily debated within the industry. Some people believe that a data lake is a next-generation big data platform. Major cloud vendors are proposing their own data lake solutions, and some cloud data warehouse products are added with features for compatibility with data lakes. But what is the difference between data warehouses and data lakes? Does the difference lie in technical routes? Does the difference lie in data management methods? Are data lakes and data warehouses incompatible? Can they coexist in harmony or even be complementary to each other? The authors of this article work in the computing platform division of Alibaba and are deeply involved in the construction of the big data and data mid-end fields. This article analyzes the ins and outs of data lakes and data warehouses from a historical perspective to illustrate the new direction for the integration and evolution of data lakes and data warehouses: the data lakehouse. This article elaborates the data lakehouse solution that is developed based on Alibaba Cloud MaxCompute, E-MapReduce (EMR), and Data Lake Analytics (DLA).
20 years of big data development
The big data field has been around since the start of this century, and is still going strong. The development of big data shows the following patterns from a macro point of view:
1) Data maintains rapid growth. The big data field is sustaining a strong growth trajectory in terms of volume, velocity, variety, value, and veracity (5 Vs). Alibaba, as a company that makes heavy use of big data and is devoted to the development of big data, has seen rapid growth in data volumes at an annual rate between 60% and 80% over the past five years. This growth rate is expected to continue into the foreseeable future. Emerging enterprises have an annual growth rate of 200% in big data.
2) Big data is well recognized as a new factor of production. The value proposition of big data has been shifted from exploration to inclusiveness. Big data becomes a core part of enterprises and governments that undertakes key tasks. For example, in Alibaba, 30% of employees directly manage big data jobs. The adoption of big data in the production environment drives enhancements in the strengths of enterprise-grade products such as reliability, safety, control, and ease of use.
3) Data management capabilities come into focus. Data warehouse (mid-end) capabilities are gaining popularity, and making good use of data has become essential to the competitiveness of an enterprise.
4) Engine technologies enter the convergence stage. Spark (universal computing), Flink (stream processing), HBase (KV), Presto (interactive analysis), Elasticsearch (search), and Kafka (data bus) gradually dominated the open-source ecosystem from 2010 to 2015. Over the past five years, the number of new open source engines witnessed a gradual decrease, but engine technologies started to develop more depth. This resulted in more benefits, such as better performance and production-grade stability.
5) Platform technology has evolved towards two trends: data lakes and data warehouses. Both focus on data storage and management but follow different paths to implementation.
1.2 A look at data lakes and data warehouses based on the development of big data
The concept of data warehouse appeared much earlier than a data lake, and can be traced back to the 1990s when databases had a dominant role. The appearance time and origins of these concepts and the reasons behind the development of data lakes and data warehouses must be understood from the historical context. In general, the development of data processing technology in computer science can be divided into four stages:
Stage 1: the database era. The concept of database came into existence in the 1960s, followed by the concept of relational database, which was invented in the 1970s and flourished for the next 30 years. Then, many excellent relational databases, such as Oracle, SQL Server, MySQL, and PostgreSQL, were developed and began to function as an integral part of mainstream computer systems. In the 1990s, the concept of data warehouse was introduced as a methodology to manage multiple database instances within an enterprise. However, the limited processing capabilities of standalone databases and the long-term high prices of distributed databases that contain database shards and table shards impeded the use of data warehouses by ordinary enterprises and users. People were even arguing about which is more feasible: data warehouses that are managed in a centralized manner, or data marts that are department- or field-specific and managed in a centralized manner.
Stage 2: the exploration of big data. At the turn of the millennium, the Internet was experiencing explosive growth. During this period, page views and user clicks proliferated to billions or tens of billions in number. This ushered in a new era of global data explosion. Traditional database solutions were no longer able to provide required computing power at acceptable costs. Huge unsatisfied data processing requirements gave rise to the big data era. In 2003, 2004, and 2006, Google published three groundbreaking papers: GFS, MapReduce, and BigTable. These papers laid down the basic technical frameworks for the big data era: distributed storage, distributed scheduling, and distributed computing models. Then, brilliant distributed technologies, including Google, Microsoft Cosmos, open-source Hadoop, and Alibaba Apsara, were released around the same time. At that time, people were too excited by the increasing volumes of data processed to consider the merits of data warehouses or data lakes.
Stage 3: the development of big data. In the second decade of the 21st century, more and more resources were invested in the big data computing field, and big data technology made remarkable progress by going from being usable to being easy to use. Various Structured Query Language (SQL) computing engines sprang up to replace costly handwritten MapReduce jobs. These computing engines were purposely designed for different scenarios but all used the easy-to-understand SQL. This significantly reduced the costs for using big data technology. Centralized data warehouses that people dreamed of in the database era became a reality, and methodologies adopted in that era started to take off. In this stage, technical routes began to split. Integrated systems that are provided by cloud vendors are called data warehouses in the big data era. Examples include AWS Redshift, Google BigQuery, Snowflake, and MaxCompute. Open file formats, open metadata services, and open HDFS storage such as open-source Hadoop systems as well as the combined use of multiple engines (such as Hive, Presto, Spark, and Flink) formed the rudiment of data lakes.
Stage 4: the popularization of big data. Big data is no rocket science. It has penetrated into all walks of life and is becoming even more popular. In addition to scale, performance, and ease of use, the market has more enterprise-grade requirements for big data products, such as cost effectiveness, security, and stability.
· Among the open-source Hadoop products, foundational components such as engines, metadata services, and storage systems have gone through steady iteration. The public is more aware of open source big data technologies than ever before. Open-source architecture has earned substantial market share due to its convenience. However, the loose nature of open-source architecture creates bottlenecks when open source solutions are used in enterprise-grade scenarios. The coordination among data security, strong identity and access control, and data governance is especially inefficient. For example, Ranger (an access control component) and Atlas (a data governance component) can still not be applied to all mainstream engines today. The development of engines themselves poses more challenges to an existing architecture. Closed-loop designs such as Delta Lake and Hudi have changed the existing architecture that typically contains one storage system, one metadata service, and multiple engines.
· It was AWS that popularized the concept of a data lake. AWS has provided an open and collaborative solution that encompasses a variety of products. The solution has S3 as the centralized storage service, Glue as the metadata service, and EMR and Athena as the engines. The openness of this solution resembles that of an open-source system. In 2019, Lake Formation was launched to enable authorization among products in the solution. Although the solution is not able to compete with cloud data warehouse products that have proven to meet the needs of large enterprises, it is still highly attractive to users of open source technologies due to similar architectures. After this move of AWS, cloud service providers followed suit and created their own data lake solutions.
· Data warehouse products advocated by cloud service providers have been in rapid development with growing core capabilities. The products have experienced major boosts in performance and a huge cut in costs. For example, MaxCompute has received a comprehensive upgrade of its core engine and continues to outperform itself year after year. MaxCompute broke the best records set by TPCx-BigBench three years in a row. The data management capabilities, such as data mid-end modelling and intelligent data warehouses, are stronger than ever before. The products have made significant progress in enterprise-grade security capabilities, marked by column-level fine-grained authorization, trusted computing, storage encryption, data masking, and the support for multiple authorization models such as ACL-based and rule-based models. Federated computing has also been improved. Data warehouses have begun to manage data that is not stored locally, thereby blurring the lines between data warehouses and data lakes.
In conclusion, the concept of what a data warehouse is came into existence in the database era, and has evolved with various database warehouse services launched by cloud service providers in the big data era. Currently, the concept typically refers to integrated services offered by cloud service providers based on big data technologies. Data lakes are born out of open source technologies in the big data era. After the integration and promotion efforts of AWS, data lakes now typically refer to big data solutions that are built from a set of cloud services or open-source components.
What are data lakes?
Although the concept of data lake has been a heated topic in recent years, its definition has not been unified. The following section describes the definition of data lakes.
The following definition is provided on Wikipedia:
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, and JSON), unstructured data (emails, documents, and PDFs) and binary data (images, audio, and video). A data lake can be established “on-premises” (within an organization’s data centers) or “in the cloud” (using cloud services from vendors such as Amazon, Google and Microsoft). A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value.
The following definition provided by AWS is more concise:
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
The versions by other cloud service providers will not be elaborated on here.
The definition of a data lake can be boiled down to the following key points:
1. A unified storage system.
2. Storage of raw data.
3. Varied computing models or paradigms.
4. Irrelevance to cloud migration.
Hadoop Distributed File System (HDFS) is a typical data lake architecture that stores raw data in a centralized manner. These days, the data lake concept is extensively discussed in its narrow sense. A data lake is referred to as a cloud-hosted storage system that uses an architecture where computing and storage are separated. Data lakes built on AWS S3 or Alibaba Cloud OSS are typical examples.
The evolution of the architecture of data lakes can be divided into three stages, as shown in the following figure.
Stage 1: Self-managed open source Hadoop data lake architecture. This architecture stores raw data in HDFS, uses open source ecosystem components Hadoop and Spark, and consolidates storage and computing resources. On the downside, it incurs high costs and leads to instability in clusters, because enterprises must perform O&M on their own.
Stage 2: Cloud-hosted Hadoop data lake architecture (or EMR). The underlying physical servers and open-source software versions are provided and managed by cloud service providers. Data is still stored in HDFS, and Hadoop and Spark are still the main engines. This architecture makes operations at the machine level more flexible and stable by using the IaaS layer on the cloud, which reduces the overall O&M costs. However, enterprises still need to perform O&M at the application layer, such as managing and governing the operating status of HDFS and its services. Because of the coupling of storage and computing resources, this architecture does not provide optimal stability. The cost is also not minimized, because these two types of resources cannot be scaled independently of each other.
Stage 3: Cloud-based data lake architecture. This architecture is completely deployed on the cloud and replaces HDFS to be the storage infrastructure for data lakes. More engines are added to the architecture. In addition to Hadoop and Spark, cloud service providers have developed other engines for data lakes. For example, AWS Athena and Huawei DLI are analytical data lake engines, and AWS Sagemaker is an AI-based data lake engine. This architecture still contains one storage system and multiple engines. In this regard, AWS launched Glue and Alibaba Cloud EMR will soon have its unified metadata service for data lakes released. This architecture has the following advantages over the HDFS architecture:
· It eliminates the need for users to perform O&M on the native HDFS. HDFS O&M is difficult due to the higher stability requirements and higher O&M risks of a storage system compared with a computing engine, as well as the limited scalability caused by the hybrid deployment of computing and storage resources. The storage-computing separation architecture helps users decouple storage from computing and leave the O&M of storage to the cloud service providers. This helps resolve stability and O&M issues.
· The separated storage system can be independently scaled and no longer needs to be coupled with the computing system, which can reduce the overall cost.
· After a user adopts this data lake architecture, storage centralization is complete. The issue of HDFS data silos is eliminated.
The following figure shows the architecture of Alibaba Cloud EMR, which is a big data platform based on an open-source ecosystem. EMR supports both HDFS open-source data lakes and OSS data lakes on the cloud.
Enterprises use data lake technologies to build big data platforms that provide features such as data access, data storage, computing and analysis, data management, and access control. The following figure is a reference architecture defined by Gartner. Current data lake technology is not fully mature in performance efficiency, security control, and data governance due to the flexibility and openness of its architecture. Therefore, it still has a long way to go to meet enterprise-grade production requirements.
Birth of data warehousing and its relationship with data mid-ends
Data warehousing emerged out of the database field with the goal to help you better query and analyze data.
A large number of database-based technologies such as SQL and query optimization have formed big data warehouses as big data technologies grew in popularity. Data warehouses are now the most popular solution due to their robust data analysis capabilities. In recent years, data warehouses have been integrated with cloud-native technologies, from which cloud data warehouses have been developed. This enables enterprises to obtain the resources required to deploy data warehouses. Cloud data warehousing provides high-level capabilities and has attracted more and more attention due to its features such as out-of-the-box ease of use, unlimited scalability, and simple O&M.
The definition of a data warehouse on Wikipedia:
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.
Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place. The data can be used to create analytical reports for workers throughout an enterprise. The term Data Warehouse was coined by W.H. Inmon in 1900. He is recognized by many as the father of data warehousing. Data warehouses collect data accumulated through online transaction processing (OLTP) and use the data storage architecture of data warehouses to analyze and arrange data. This enables various analysis methods such as online analytical processing (OLAP) and data mining, and further supports the creation of decision support systems (DSS) and executive information systems (EIS). As a result, decision makers can quickly and effectively extract valuable information from a large amount of data, facilitating decision-making and fast responses to changes in the external environment and helping build business intelligence (BI).
The following three aspects are essential for a data warehouse:
1. A built-in storage system is used. Data is provided in an abstract manner. For example, you can use tables or views in SQL without exposing the file system.
2. Data must be cleansed and transformed by performing the Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) process.
3. Modeling and data management are crucial to intelligent business decision-making.
Based on the preceding standards, both traditional data warehouses such as Teradata and emerging cloud data warehousing services such as AWS Redshift, Google BigQuery, and Alibaba Cloud MaxCompute demonstrate the essence of designing data warehouses. None of them exposes the file system externally, but they provide service interfaces for importing or exporting data. For example, Teradata provides a command line interface (CLI) tool for importing data, Redshift provides a COPY command for loading data from Amazon S3 or EMR, BigQuery provides the BigQuery Data Transfer service, and MaxCompute provides the tunnel service and the MaxCompute Migration Assist (MMA) data migration tool for uploading and downloading data.
This design can bring several advantages:
1. Helps engines process data and further optimizes storage and computing.
2. Manages the whole lifecycle of data and builds a comprehensive lineage system.
3. Provides fine-grained data management and governance.
4. Provides comprehensive metadata management capabilities, making it easy to build an enterprise-grade data mid-end.
Therefore, during the initial construction of the Alibaba Cloud Apsara big data platform, MaxCompute was selected as the data warehousing architecture. MaxCompute (previously known as ODPS) is not only a big data platform within the Alibaba economy, but also a secure, reliable, high-performance, and low-cost online big data computing service that provides on-demand scaling of data volumes from gigabytes to exabytes. Figure 6 shows the architecture of MaxCompute. You can go to the official website of Alibaba Cloud MaxCompute to learn more. MaxCompute is an enterprise-grade cloud data warehouse based on the Software-as-a-Service (SaaS) model, and is widely applied within the Alibaba economy and by thousands of Alibaba Cloud customers in fields such as the Internet, new finance, new retail, and digital government.
Based on the architecture of MaxCompute, Alibaba was able to gradually build the data security system and provide additional capabilities in terms of data quality, data governance, and data tagging to form a big data mid-end. Alibaba was the first to propose the concept of data mid-end and use the data warehousing architecture on it.
Data lakes vs. data warehouses
In summary, data warehouses and data lakes represent two different approaches to building big data architectures. They mainly differ when it comes to handling the storage system access, permission management, and modeling requirements.
If a data lake architecture is used, the underlying file storage is opened to maximize the flexibility of data storage. Data lakes can store structured, semi-structured, and unstructured data. In addition, open storage makes upper-layer engines more flexible. Different engines can freely access data stored in data lakes based on specific scenarios and just follow fairly loose compatibility conventions. However, such loose conventions may lead to risks, which will be mentioned later. At the same time, direct access to the file system makes it difficult to implement many higher-level features. For example, fine-grained (smaller than the file level) permission management, centralized file management, and upgrades of read/write interfaces are difficult to manage. To upgrade read/write interfaces, you must upgrade each engine that accesses files.
However, if a data warehousing architecture is used, the enterprise-grade requirements for data usage efficiency, large-scale data management, and security and compliance can be fulfilled more easily. Data is loaded to a data warehouse through a unified but open service interface. The data schema is typically defined in advance, and users access the files stored in the distributed storage system by using data service interfaces or computing engines. A data warehousing architecture allows more efficient management of data access interfaces, permissions, and data. At the same time, it can provide higher storage and computing performance, a closed-loop security system, and better data governance capabilities. These capabilities are needed to allow enterprises to obtain the value of big data and continue to use big data into the future.
The following figure shows the trade-offs of data lakes and data warehouses in big data technology stacks.
Flexibility and maturity are of different importance to enterprises in different development stages.
1. An enterprise that is at the startup stage needs innovation and exploration before it can gradually settle down from data generation to consumption. Therefore, flexibility is of greater importance to big data systems that are used to support such business. In this case, a data lake architecture is more suitable.
2. When the enterprise gradually matures and settles into a series of data processing procedures, the enterprise faces problems such as the continuous growth of data volume, the increasing cost of processing data, and the increasing number of personnel and departments involved in data processing. The maturity of the big data system used to support such business determines the future development of the business. In this case, a data warehousing architecture is more suitable.
It is observed in this article that a considerable number of enterprises (especially enterprises in the emerging Internet industry) have started to build big data technology stacks from scratch. They have experienced such a process from exploration and innovation to mature modeling along with the popularity of the open source Hadoop system.
In this process, a data lake architecture is too flexible and lacks data supervision, control, and necessary governance methods. As a result, the O&M costs continue to increase and the data governance efficiency decreases. The enterprise thus falls into a “data swamp” where a large amount of data is stored in the data lake, making it difficult to efficiently refine the truly valuable data. In this case, data must be migrated to a big data platform that puts data warehouses in priority so that problems such as the O&M, costs, and data governance that appeared after the business grows to a certain scale are solved. Take one of the Alibaba’s cases as an example. The data mid-end strategy of Alibaba was successful only when MaxCompute data warehouses completely replaced multiple Hadoop data lakes around 2015.
Next-generation big data platform: LakeHouse
This article presents an in-depth exposition and comparison of data lakes and data warehouses, which represent two different evolution routes for big data systems. Data lakes and data warehouses have their own advantages and limitations. Data lakes are more applicable to startups, whereas data warehouses are more suitable for growth enterprises. Do enterprises have to choose one over the other when it comes to data lakes and data warehouses? Can a solution be available to integrate the flexibility of data lakes and the maturity of data warehouses to achieve a lower total cost of ownership for users?
In recent years, the integration of data warehouses and data lakes has been a trend in the industry. Many products and projects have made attempts with varying degrees of success.
1. Data warehouses support access from data lakes.
· In 2017, Redshift introduced Redshift Spectrum to support users of Redshift data warehouses to access data in S3 data lakes.
· In 2018, Alibaba Cloud MaxCompute introduced the external table feature to support access to a variety of external storage services including OSS, Tablestore, and ApsaraDB RDS.
However, both Redshift Spectrum and the external table feature of MaxCompute still require users to create external tables in data warehouses before the open storage paths in data lakes can be incorporated into the data warehouses. An open storage service is unable to describe the changes in its stored data. Therefore, creating external tables and adding partitions (creating data schemas for data lakes) for the data cannot be fully automated. Manual operations are required, or the ALTER TABLE AND PARTITION or MSCK (REPAIR TABLE) command must be triggered periodically. This is acceptable for infrequent temporary queries, but is somewhat complicated for production use.
2. Data lakes support the capabilities of data warehouses.
· In 2011, Hortonworks, a leading vendor of Hadoop as an open source system, started the development of two open source projects: Apache Atlas and Ranger. These projects are intended to provide the core capabilities of data warehouses: data lineage tracing and data permission security. However, the development of the two projects was not smooth. The incubation was not completed until 2017, and the two projects are still not widely deployed in communities and the industrial world to this day. The main reason for this is the inherent flexibility of data lakes. For example, as a component for centralized management of data permission security, Ranger requires that all engines be adapted to it to ensure that no security vulnerabilities exist. However, for engines in data lakes that emphasize flexibility, especially new engines, implementing features and scenarios take priorities over connecting to Ranger. This leaves Ranger at an awkward position in terms of data lakes.
- In 2018, Nexflix open-sourced the internally enhanced version of its metadata service system, Iceberg, to provide enhanced data warehouse capabilities such as multi-version concurrency control (MVCC). However, the open source HMS has already become a de facto standard. The open source Iceberg is compatible, but only works with HMS as a plug-in. As a result, the data warehouse management capabilities are greatly undermined.
· From 2018 to 2019, Uber and Databricks successively launched Apache Hudi and DeltaLake, providing incremental file formats to support data warehouse features such as UPDATE and INSERT queries and transactions. New features bring about changes in file formats and organizational forms, breaking the original simple convention on shared storage among multiple data lake engines. To maintain compatibility, Apache Hudi offered support for two table types (Copy-On-Write and Merge-On-Read) and three query types (Snapshot Queries, Incremental Queries, and Read Optimized Queries), and provided a support matrix, as shown in Figure 10, which greatly improves the sophistication of use.
In contrast, DeltaLake chose to ensure the experience with Spark as its main support engine, which compromised its compatibility with other major engines. This causes restrictions and inconvenience when users access Delta data stored in data lakes by using these engines. For example, if Presto wants to use a DeltaLake table, it must first create a manifest file in Spark, and then create an external table based on the manifest file. The manifest file also poses update problems. Hive is subject to more restrictions if it wants to use DeltaLake tables. Disorder may occur at the metadata layer, and data may even fail to be written to the tables.
The several attempts to build data warehouses on the data lake architecture were not successful. This indicates that data warehouses and data lakes are essentially different. It is difficult to build a complete data warehouse on top of a data lake system. Due to the difficulty in directly merging data lakes and data warehouses into a single system, the authors of this article began to explore a new approach to integrate the two. Consequently, a next-generation big data architecture is proposed: LakeHouse. LakeHouse integrates data warehouses and data lakes and allows data to flow freely between data warehouses and data lakes for computing. This allows for a complete, organic big data ecosystem.
LakeHouse is expected to implement the following key features:
1. Data and metadata in data lakes and data warehouses is seamlessly integrated without user intervention.
2. A unified development experience is available for data lakes and data warehouses. Data stored in different systems can be managed by using a unified development and management platform.
3. The system automatically caches and moves data in data lakes and data warehouses. The system can distinguish between the data to store in data warehouses and that to store in data lakes.
The following chapter describes how Alibaba Cloud LakeHouse implements these features.
Alibaba Cloud LakeHouse
6.1 Overall architecture
Based on the original data warehouse architecture, Alibaba Cloud MaxCompute integrates open source data lakes with data lakes on cloud storage to implement the architecture of integrated data warehouses and data lakes, as shown in Figure 11. In this architecture, an integrated encapsulation interface is provided for the upper-layer engine through a unified storage access layer and centralized metadata management, despite the coexistence of multiple storage systems at the underlying layer. You can join a table in a data warehouse with a table in a data lake. In addition, the overall architecture provides a unified mid-end to ensure data security and manage data.
To implement the three key features described in the previous chapter, MaxCompute provides four key technologies:
1. Fast integration
· MaxCompute provides the network connection technology PrivateAccess. While in compliance with the cloud virtual network security standards, this technology enables user-specific jobs in multi-tenant mode to connect to the IDC, ECS, or EMR Hadoop cluster network with low latency and high exclusive bandwidth.
· Data lakes can be connected to MaxCompute data warehouses with quick, simple provisioning and security configuration operations.
2. Centralized data and metadata management
· MaxCompute implements centralized metadata management. The database metadata mapping technology is used to seamlessly integrate metadata in data lakes and MaxCompute data warehouses. MaxCompute allows users to create external projects. Databases in HiveMetaStore are directly mapped to MaxCompute projects. Changes to a Hive database are reflected in the mapped project in real time, and the data in this project can be accessed and computed at any time in MaxCompute. Alibaba Cloud will also release the Data Lake Formation service for the EMR data lake solution. In addition, the MaxCompute LakeHouse solution will support the mapping capability for the unified metadata service. MaxCompute operations on external projects are also reflected on the Hive side in real time to implement seamless interaction between data warehouses and data lakes. Manual metadata interventions similar to those in federated query solutions are not needed.
· MaxCompute implements a storage access layer that integrates data lakes and data warehouses. This layer supports not only optimized built-in storage systems, but also external storage systems. In addition, it supports both HDFS data lakes and OSS data lakes on the cloud, as well as read/write operations on data in various open source file formats.
3. Unified development experience
· Hive databases in data lakes map to external projects in MaxCompute. Then, the databases work the same way as common projects and can use the data development, tracking, and management features of MaxCompute data warehouses. Data development, management, and governance capabilities provided by DataWorks deliver a unified developed experience with data lakes and data warehouses to reduce the costs of managing two systems.
· MaxCompute is highly compatible with Hive and Spark. A set of tasks can flexibly and seamlessly run between the data lake and data warehouse systems.
· MaxCompute provides efficient data tunnel interfaces that the engine of the Hadoop ecosystem in data lakes can directly access, which enhances the openness of data warehouses.
4. Automatic data caching
· The integration of data lakes and data warehouses requires users to separate and store data between data lakes and data warehouses to maximize the advantages of data lakes and data warehouses based on the asset usage of the users. An intelligent caching technology is developed for MaxCompute. This technology can identify cold and hot data based on historical tasks. The available vacant bandwidth is used to cache hot data from data lakes to data warehouses in high-efficiency file formats at off-peak hours, which further accelerates the subsequent data manipulation process for data warehouses. This technology resolves the bandwidth bottlenecks between data lakes and data warehouses and implements tiered data management and governance as well as performance acceleration without human intervention.
6.2 Build a data mid-end that integrates data lakes and data warehouses
The integration of data lakes and data warehouses in MaxCompute enables DataWorks to encapsulate data lake and data warehouse systems, block the heterogeneous cluster information of data lakes and data warehouses, and then build an integrated big data mid-end. This allows a set of data and a set of tasks to be seamlessly scheduled and managed between data lakes and data warehouses. Enterprises can use the capabilities of the data mid-end that integrates data lakes and data warehouses to optimize data management architectures and fully leverage the respective advantages of data lakes and data warehouses. Data lakes are used to provide centralized raw data storage to maximize the flexibility and openness of data lakes. The lakehouse technology is used to seamlessly schedule frequently accessed production data and tasks to data warehouses for better performance and lower costs, as well as subsequent production data governance and optimization. This allows enterprises to strike a balance between costs and efficiency.
Overall, the integration of data lakes and data warehouses in MaxCompute provides a more flexible, efficient, and cost-effective data platform solution. This solution is suitable for enterprises to build new big data platforms or upgrade architectures of existing big data platforms. This protects existing investments while implementing reuse of existing assets.
6.3 Success story: Sina Weibo uses LakeHouse to build a hybrid cloud mid-end for AI computing
· Background information
The machine learning platform team of Sina Weibo focuses on social media technologies such as recommendations, sorting, text and image classification, and anti-spam protection. The technical architecture is deployed on the open source Hadoop-based data lake solution. One HDFS and multiple computing engines such as Hive, Spark, and Flink are used to meet requirements of multiple AI computing scenarios. However, as the top social media application in China, Sina Weibo faces new challenges in the current business volume and complexity. The open source data lake solution cannot meet the requirements of Sina Weibo in terms of performance and costs. The Apsara big data and AI platform capability of Alibaba Cloud (MaxCompute + PAI + DataWorks) enables Sina Weibo to resolve the performance bottlenecks for feature engineering, model training, and matrix computing in ultra-large-scale data computing scenarios. This creates a situation where the Alibaba Cloud MaxCompute platform (data warehouse) and open source platform (data lake) coexist.
· Pain points
Sina Weibo hopes to use the two heterogeneous big data platforms to maintain the flexibility of AI-oriented data and computing and additionally resolve the performance and cost problems with computing and algorithms in ultra-large-scale data computing scenarios. However, the two big data platforms are completely separated at the cluster level. Data and computing in one platform are isolated from the other platform. This significantly increases costs for data transfer, computing, and development and restricts business development. In short, the following pain points exist:
1) Specific personnel must be arranged to train data synchronization, which consumes heavy workload.
2) It is time-consuming to train large amounts of data, which fails to meet the requirements of real-time training.
3) You must write new SQL statements to process data. Existing Hive SQL statements cannot be used.
To address the preceding pain points, the Alibaba Cloud product team and Sina Weibo machine learning platform team jointly build the lakehouse technology. Data warehouses of Alibaba Cloud MaxCompute can interact with data lakes of EMR Hadoop, which enables an AI mid-end that integrates data lakes and data warehouses to be built. Network infrastructure is thoroughly upgraded for MaxCompute. Users can connect to their own VPCs. The mapping capability of Hive and powerful sophisticated SQL and Machine Learning Platform for AI (PAI) engine capabilities enable data warehouses of MaxCompute and data lakes of EMR Hadoop to be seamlessly integrated. This allows data lakes and data warehouses to be managed and scheduled in a centralized and intelligent manner.
· This solution merges the best parts of data lakes and data warehouses, strikes a balance between flexibility and efficiency, and then allows a unified AI mid-end. This way, the business support capability of the machine learning platform team is improved. A set of jobs can be seamlessly and flexibly scheduled between MaxCompute clusters and EMR clusters without the need to migrate data and jobs.
· SQL-based data processing tasks are scheduled to run in MaxCompute clusters, which improves performance. Based on the algorithms provided by Alibaba Cloud PAI, various business scenario-specific algorithm services are encapsulated to meet more business requirements.
· Cloud native elastic resources of MaxCompute and EMR cluster resources complement each other for load leveling. Resources of the two systems are used at off-peak hours to reduce the overall costs and the periods of time in which jobs are queued.
Data lakes and data warehouses are two data architecture design approaches to building distributed systems in the context of big data technologies in the modern era. To determine which approach is to take, you must consider whether your business favors flexibility or enterprise-class features such as costs, performance, security, and governance. However, the boundary between data lakes and data warehouses is slowly blurring. The governance capability of data lakes and the capability of connecting data warehouses to external storage are continually enhanced. In this context, MaxCompute takes the initiative in proposing the integration of data lakes and data warehouses, which shows the industry and users an architecture where data lakes and data warehouses complement and cooperate with each other. This architecture also provides users with the flexibility of data lakes and a host of enterprise-class features of data warehouses. This way, the total cost of ownership (TCO) of using big data is further reduced. It is for these reasons that this architecture represents the direction in which the next-generation big data platform will evolve.