Saturday, February 5, 2022

Kafka with CDC and Delta Lake’s CDF

This is a second part of the Data Lakehouse and data pipelines implementation in the Delta Lake. Source code GitHub repositories are at the end of this article. For a high level pipeline architecture, please take a look at the first part.

Road to Lakehouse — Part 1: Delta Lake data pipeline overview

Raw Ingestion

I divide the Kafka data into 2 categories: event data which comes from the backend application and cdc data which is generated by Debezium. Below is a main PySpark job.

Ingest data from Kafka

To ingest the data from Kafka, we just need to specify credentials and topic name.

Process the Kafka data

We need to get the schema of the topic first.

In case of CDC, below is an example of MongoDB payload schema.

Once we have the schema, it is easy to get the data in plain texts.

Load the processed data into a raw area

Eventually let write the stream to a delta table in the raw zone. Spark stores Kafka offsets in checkpoint locations for failure recoveries.

Refined zone

While there are several ways to transform data inside the Delta Lake between different layers such as using dbt or Delta Live Tables, we can leverage the built-in property in delta tables which is the Change Data Feed (CDF) without any additional cost to merge changes to the next area in near real-time.

Below is a main PySpark job to extract changes from the raw table then process, load them into the refined table.

Read change feed from raw tables

Process the CDF

We can do some transformations like flattening a nested json or exploding an array before loading the data into the refined layer.

Conclusion

Above streaming data pipeline is a good starting point to build further business level tables. In your real project you might need to do more complicated operations like upserting or joining multiple tables, but the general idea of using Spark structured streaming and CDF is still valid. There are many more things to do with the Delta Lake like using SQLAlchemy ORM to query data, visualizing data with Streamlit and building ML workflows with Databricks Feature Store, AutoML, MLflow, .etc, hopefully we can discuss more about them in the next article.

Source code

GitHub - tam159/delta-lake-library: Delta lake common libraries to ingest, process and load data…

Data Python artifact wheel file is installed in the Spark cluster and can be used in all cluster notebooks. The package…

github.com

GitHub - tam159/delta-lake-pipeline: Delta lake pipelines in Databricks lakehouse with Spark…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…