Kafka with CDC and Delta Lake’s CDF
This is a second part of the Data Lakehouse and data pipelines implementation in the Delta Lake. Source code GitHub repositories are at the end of this article. For a high level pipeline architecture, please take a look at the first part.
Road to Lakehouse — Part 1: Delta Lake data pipeline overview
Raw Ingestion
I divide the Kafka data into 2 categories: event data which comes from the backend application and cdc data which is generated by Debezium. Below is a main PySpark job.
Ingest data from Kafka
To ingest the data from Kafka, we just need to specify credentials and topic name.
Process the Kafka data
We need to get the schema of the topic first.
In case of CDC, below is an example of MongoDB payload schema.
Once we have the schema, it is easy to get the data in plain texts.
Load the processed data into a raw area
Eventually let write the stream to a delta table in the raw zone. Spark stores Kafka offsets in checkpoint locations for failure recoveries.
Refined zone
While there are several ways to transform data inside the Delta Lake between different layers such as using dbt or Delta Live Tables, we can leverage the built-in property in delta tables which is the Change Data Feed (CDF) without any additional cost to merge changes to the next area in near real-time.
Below is a main PySpark job to extract changes from the raw table then process, load them into the refined table.
Read change feed from raw tables
Process the CDF
We can do some transformations like flattening a nested json or exploding an array before loading the data into the refined layer.
Conclusion
Above streaming data pipeline is a good starting point to build further business level tables. In your real project you might need to do more complicated operations like upserting or joining multiple tables, but the general idea of using Spark structured streaming and CDF is still valid. There are many more things to do with the Delta Lake like using SQLAlchemy ORM to query data, visualizing data with Streamlit and building ML workflows with Databricks Feature Store, AutoML, MLflow, .etc, hopefully we can discuss more about them in the next article.
Comments