This is a first part of the Data Lakehouse and data pipelines implementation in the Delta Lake. Source code GitHub repositories are at the end of this article. For the code explanation, please take a look at the second part.
Road to Lakehouse - Part 2: Ingest and process data from Kafka with CDC and Delta Lake's CDF
What and why Data Lakehouse?
Lakehouse is a buzzword in the data field nowadays. According to AWS, the lake house architecture is about integrating a data lake, a data warehouse, and purpose-built stores to enable unified governance and easy data movement. Regarding to Databricks, the Lakehouse is an open architecture that combines the best elements of data lakes and data warehouses.
And for me, the Lakehouse is simply the data lake with both columnar files and delta files (transaction logs) in row based formats.
Delta Lake, Iceberg and Hudi are 3 popular data lake table formats that support ACID, schema evolution, upsert, time travel, incremental consumption, etc.
In this post, we will focus on the Delta Lake with support from Databricks.
Data lakehouse features
In comparison with the pure data lake, the Lakehouse provides:
- ACID transactions to ensure data integrity and consistency as multiple parties concurrently read or write data.
- Schema enforcement and evolution that we can optionally raise an error when data source schema changes or automatically merge it with the new one.
- Time travel or data versioning allows us to access and revert to earlier versions of data, rollbacks or reproduce experiments.
- Full DML supports UPDATE, DELETE and especially MERGE INTO is really useful in case of using with CDC and Delta Lake Change Data Feed (CDF).
- End-to-end streaming eliminates the need for separating systems dedicated to serve real-time data applications because Delta Lake table is both a batch as well as a streaming source and sink
The Lakehouse also has more features compared with the traditional Data Warehouse:
- Openness: The storage formats are open and standardized such as Apache Parquet which enables Delta tables to be queryable by different tools with or without Spark like Presto, Trino and different languages including Scala, Java, Rust, Python, Ruby, .etc
- Support for diverse data types ranging from unstructured to structured data, so we can not only store structured, semi-structured data and text in the Lakehouse but also binary files like images, video, audio, .etc
- Support for diverse workloads including data science, machine learning, and SQL analytics. It means that we do not need to move the data between data lake and data warehouse for training data, one Lakehouse destination is nearly enough for almost all purposes.
There are some highlights of Delta Lake:
- Delta Sharing secures the way to share data with other organizations regardless of which computing platforms they use.
- Data optimization including compaction (coalescing small files into larger ones) and ordering.
- Vacuum: remove data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold.
- Table caching in the Spark nodes’ local storage.
Data pipeline overview
For a demonstration, we will implement a below data pipeline in the Delta lake.
- Regarding to data sources, there can be multiple sources such as:
- OLTP systems logs can be captured and sent to Apache Kafka (1a)
- Third party tools can also be integrated with the Kafka through various connectors (1b)
- And back-end team can produce event data directly to the Kafka (1c)
2. Once we have the data in the Kafka, we will use Spark Structured Streaming to consume the data and load it into Delta tables in a raw area. Thanks to checkpointing which stores Kafka offsets in this case, we can recover from failures with exactly-once fault-tolerant. We need to enable the Delta Lake CDF feature in these raw tables in order to serve further layers.
3. We will use the Spark Structured Streaming again to ingest changes from the raw tables. Then we would do some transformations like flattening and exploding nested data, .etc and load the cleansed data to a next area which is a refined zone. Remember to add the CDF in the refined tables properties.
4. Now we are ready to build data mart tables for business level by aggregating or joining tables from the refined area. This step is still in near real-time process because one more time we read the changes from previous layer tables by Spark Structured Streaming.
5. All the metadata is stored in the Databricks Data Catalog and all above tables can be queried in Databricks SQL Analytics, where we can create SQL endpoints and use a SQL editor. Beside the default catalog, we can use an open source Amundsen for the metadata discovery.
6. Eventually we can build some data visualizations from Databricks SQL Analytics Dashboards (formerly Redash) or use BI tools like Power BI, Streamlit, .etc.
Conclusion
The Lakehouse provides a comprehensive architecture that benefits from both Data Lake and Data Warehouse. Just with the straight data pipeline, we can serve for analytical and operational purposes. In the next part, we will ingest streaming data from MongoDB with Debezium and back-end events then load it into the first zone of the Lakehouse, after that we will process changes from the raw area and store cleansed and flattened data in a refined zone.
No comments:
Post a Comment