Developing Data Engineering Solutions with Databricks

 118

Data Solutions Architecture — image created by the author

The goal of a data engineering (DE) solution is to provide the right stakeholders with the data they need, in the format they need, when they need it.

I emphasise “solution” over “pipeline” because data processing code is just one part of a data engineering solution. In my opinion, coding transformation and processing logic is not the same thing as developing the overarching solution that ultimately delivers the value.

There is a lot of content on the internet on how to develop pipelines with Spark and Databricks.

In this article, I take a broader view on general data engineering challenges in developing solutions and how to address them in Databricks.

Why We Need Environments and Tests

Data is not inherently valuable. It’s only valuable when it’s used by people. People only use data they trust and which provides the information they need. In consequence, the most important aspects of every data product are reliability, stability and relevance.

To be stable and reliable, solutions need to pass quality assessments. Therefore, we need at least two environments: one where we develop, experiment, and test, and one that contains the most stable version of the solution, which is then used by people or applications. The second type of environment is called “production.” Production can mean various things to different people. For me, a solution is in production as soon as someone else relies on its output.

Development Environments

To develop data processing code, apart from storage and compute, we need data and information about the data. In production environments, we have to process the real data generated by the source systems. However, developing the logic based on live data is oftentimes not possible because:

  • It might be confidential
  • We don’t want to put extra load on the source system by continuously retrieving data from it
  • It’s too expensive/complicated to develop on the actual data
  • It’s not possible/allowed to create connections to productive source systems from development environments

We are then left with two other options:

  • Extracting a sample of the data
  • Creating mock/fake representative data based on descriptions

Using either approach, we may not cover all edge cases, but it will allow us to continue with development.

To make the code as reliable as possible, we want to break down the logic into small pieces so that we can limit the complexity and the amount of issues that can occur. We package these small pieces of logic into functions. Functions are supposed do one single thing and follow certain best practices.

To test individual functions, we use unit tests. Unit tests have several main components: the setup phase where the environment and variables are prepared, the execution phase where the function is called with specific inputs, and the verification phase where the outputs are compared against expected results. Additionally, unit tests often include a teardown phase to clean up any changes made during the test.

In an ideal scenario, we would have a perfect description of the data. Then we could develop tests that ensure the functions will always perform as expected. However, the reality is that, except for very simple cases, data will always eventually present some anomaly. To cover the most expected cases, functions are developed iteratively on sample and mock data and then validated with the best available test data.

With that being said, testing with static data becomes complicated when we want to test how our logic behaves with a continuous stream of new data. For example, we might want our function to only process the latest data. For that, we need to have a mechanism to run the function, see if it does what it is supposed to, add new data to the source and then run and test it again.

Similarly, we want to observe how the target data evolves over time with each new insert, update, or merge. We test how each operation behaves alone, but we also need to see if, after several iterations, the output still matches our expected results.

Regardless, even if we test every function individually perfectly, the likelihood that the entire solution will work with real-life data is relatively low. Additionally, meeting other requirements such as performance is also unlikely.

Therefore, before deploying to production, we need a third environment that closely resembles the production environment. This is called a test environment.

Test Environments

Test environments are used to test for end-to-end consistency and performance. They help identify issues that might not be evident in unit tests, ensuring that the solution can handle real-world conditions.

To avoid deploying faulty code into production, the test environment should contain real data. It should depict end-to-end scenarios, including all processing steps and connections to source and target systems. Additionally, the test environment should have settings similar to the production environment, such as clusters with the same performance.

We have multiple types of tests: integration tests, system tests, performance tests, and other variations. Integration tests focus on the interaction between components, system tests evaluate the entire system’s behavior, and performance tests assess the system’s responsiveness and stability under load. In addition, acceptance tests might be used to validate that the output fulfills the business requirements.

Production Environments

The purpose of a production environment is to provide users and applications with stable, reliable, and up-to-date data. Once people start relying on the system to make decisions, interruptions or inconsistencies in the production environment can have severe implications.

To avoid this, we also want to closely monitor the production environment to ensure that we deliver the expected performance and availability. This involves tracking key metrics such as data quality, setting up alerts, and regularly checking the processes to make sure everything is working correctly.

To reach production, the code should pass through all tests so that we can achieve the goals of reliability, stability, and relevance we set out in the beginning.

Moreover, the initial versions of the productive system might be created manually, but eventually, no code or data should be introduced manually into the production environment to avoid errors. Implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines ensures that every change is automatically tested and deployed.

Production environments also need to be depicted as Infrastructure as Code (IaC) so that, in case of failure, they can be immediately recreated. By using IaC, we can also reuse the infrastructure setup in other projects or create comprehensive standalone tests without affecting production. Tools like Terraform and Azure Resource Manager can be used to define and manage the infrastructure.

How to Approach Environments and Testing in Databricks

Now that we have covered the theory, let’s look at the options we have in Databricks. Depending on the circumstances, we might need more or less complicated setups.

Databricks workspaces are the interfaces we use to connect code, storage, compute, and data. Even though it’s theoretically possible to create the “environments” within the same workspace by restricting access, enforcing policies for certain user groups, etc., in practice, environments are usually separated into their own workspaces.

This means that associated resources such as storage, networking, and governance also can or should be separated, and we need to configure them according to the requirements of the particular environment. For example, production data should be stored in some kind of redundant and performant storage location, while this might not be necessary for development environments.

However, one important point to mention is that with Unity Catalog, much of the metadata and user management has been centralised. This is in contrast to the legacy Hive metastore, where every workspace has its own isolated components. Whether and how we want to use Unity Catalog depends a lot on the organisational framework, but the newest features, such as central user management and the availability of catalogs, can impact how we design certain parts of the environment.

Comparison between Legacy Hive Metastore and Unity Catalog. Image created by the author based of Unity Catalog Overview by Databricks

Another central question is how we want to manage the different workspaces and their scope. In their article, “Functional Workspace Organisation on Databricks”, the authors provide a very good overview of the different approaches and considerations:

  • One set of Dev-Test-Prod workspaces for the entire organisation
  • Separate Dev-Test-Prod workspaces for each line of business
  • Separate Dev-Test-Prod workspaces for each data product

Each variation has its own pros and cons. I will not reiterate the general recommendations for workspace setups but rather focus on the implications and factors relevant for Data Engineering.

Overarching Databricks Features for Development and Testing

Before going into the details for each individual environment, there are certain overarching features that apply to all environments, which I will briefly describe here:

  • Testing:
    Testing notebooks is different from classical code but there are ways to do it, such as using built-in Python unittest package, or pytest , widgets for mode selection, and scheduling tests to run automatically
  • Static Code Analysis:
    Even though not as straightforward, Databricks supports tools for monitoring and analysing notebook command logs, helping in maintaining code quality and compliance
  • Repos:
    Databricks Repos provides integrated version control for notebooks and other files. It supports Git-based workflows, enabling collaboration, version history, and code review processes directly within the Databricks workspace
  • CLIAPI and SDK:
    Databricks offers a broad range of options to automate and manage tasks through its CLI, API, and SDK
  • Monaco Editor:
    In 2023 Databricks adopted the Monaco Editor which also used in VS Code. The editor introduced autocomplete, parameters hints, docstring on hover, syntax highlighting, code collapsing, Python code formatting with black and many more capabilities. Recently, Databricks also introduced the option of debugging inside of notebooks. All of these features make the developing experience inside of Databricks much more professional and are real improvements compared to the previous versions of the editor
  • Orchestration:
    Databricks has its own built-in orchestration tool called Workflows. Workflows play an important role in all stages of the development cycle and in the production setup, enabling us to schedule, automate, and manage pipelines and tasks within Databricks

Developing Locally vs. On Databricks Clusters

Spark is the execution engine of Databricks. We can use the Python, SQL, R, and Scala APIs of Spark to run code on Spark clusters. But Databricks is more than just an execution environment for Spark (even though it can be if that is what is needed). It offers many additional and proprietary features such as Unity CatalogSQL Warehouses, Delta Live TablesPhoton, etc. For many companies, these features are the reason why they choose Databricks over other solutions.

However, most of the processing logic usually uses functionality that is also available in the open-source Spark or Delta versions. This means that we can theoretically create a local environment with the right Spark and Delta versions which mimic the Databricks Runtime. We can then develop and unit test our logic there and then deploy the code to the test environment. There can be good reasons to do this. The most cited ones are reducing cost and more professional development in IDEs in contrast to notebooks.

However, there are several things to consider. Apart from simple issues, such as the missing Databricks utility functions (dbutils), which we can implement in other ways, there are some more important factors. Developing without any sample data is difficult unless the requirements are perfect. If we have sample data, we might not be allowed to download it onto a local machine. Alternatively, the requirements need to be so precise that we can break down the logic into such small and abstract pieces that the data itself becomes irrelevant.

I am also personally not a fan of this approach because even if there is a single mismatch between the environments, the effort to figure out why will probably exceed the cluster costs. Moreover, with the latest features Databricks provides — debugging in notebooksvariables explorerrepos, the newest editor, easier unit testing, etc. — development inside of notebooks is much more professional compared to a couple of years ago.

We ultimately also want to develop and experiment with other features such as workflows, clusters, dashboards, etc., and play around a bit. So, we will need to have at least one development workspace. Another consideration is that the cheapest 14 GB RAM cluster currently costs about $0.41 per hour. If we need more computational power for development than what a typical local machine offers, we will anyway have to develop on a Databricks cluster unless we have an on-prem setup.

Overall, developing directly on Databricks clusters is generally easier and more straightforward. Nonetheless, if cost is a significant factor and the circumstances are right, it might be worth investigating a local development workflow. However, this will become more difficult over time as more proprietary features that we also want to use in development are introduced.

For those who want to work in an IDE, Databricks Connect V2 (which is significantly better than V1 from a couple of years ago) might be a viable option to consider.

In conclusion, we have three main options for the actual development:

  1. In Notebooks
  2. In Local Environment — Replicating Databricks Runtime
  3. In IDE on Databricks Clusters — Databricks Connect V2

In the end, the development environment doesn’t matter per se. The code should be abstract and parametrisable, without depending on the context. As long as we can move the code to the test environment without manual modifications, the best approach is the one that meets your personal, team’s, and organisation’s goals.

Test Environments in Databricks

To perform integration, system, and performance tests, we need the test environment to be as similar as possible to the production environment. Setting up a robust test environment involves several considerations:

1. Data Consistency
We need to ensure that the test environment contains a representative subset of the production data (if feasible, even the real data). This allows for realistic testing scenarios, including edge cases. Using Delta Lake, the standard table format in Databricks, we can create “versioned datasets”, making it easier to replicate production data states in the test environment.

2. Cluster Configuration
We should match the cluster configurations between the test and production environments. This includes cluster size, types of instances used, and any specific configurations like auto-scaling policies. Almost every asset we have in Databricks can be depicted in code. Even if we don’t automate the creation of the artefacts, we can still create identical copies using the CLI, SDK or API.

3. Performance Testing
Databricks offers several tools to measure a solutions‘s responsiveness and stability under load. We can use the Spark UI to see the query execution plans, jobs, stages, and tasks. Databricks also provides compute metrics which allow us to monitor metrics such CPU and Memory usage, Disk and Network I/O. We can create scenarios to simulate high-load situations and and then measure how the system performs. In addition, we can also consider other features such as Photon (Databricks’s proprietary and vectorised execution engine written in C++).

4. Integration and System Tests
In contrast to unit tests, there isn’t a lot of information online on how to create integration or system tests in Databricks. This is likely due to the complexity and variability of the topic. Integration tests should cover all steps of the pipeline, from ingestion to serving. For simple pipelines, this can be done fairly easily, but if we need to set up tens of tables that then need to be joined together and recreated every time, such tests can be very resource-intensive.

Therefore, I recommend utilising Delta Lake functionalities such as “time travel” and “deep/shallow clones”. We can maintain a stable data setup within the environment, and for every test we can create copies of the persistent data or reset the tables to the original state after every test.

Production Environments in Databricks

There are four main components to every data solution: data, storage, code and compute. In addition to that, governance and security (In-depth guide of security best practices from Databricks) also play a crucial role.

In productive environments, all of these factors require the highest level of professionalism to achieve a stable and reliable solution.

1. Storage
Production data should be stored in redundant and high-performance storage locations. Databricks itself discourages storing data on the Databricks Filesystem (DBFS), so we should use external solutions such as Azure Data Lake Storage or AWS S3. This approach makes our assets unmanaged. If data is mistakenly deleted in Databricks, only the metadata in the workspace is removed. The underlying data in the storage locations is retained and can be used to recreate the tables inside the workspace.

2. Code
Production code needs to be stored in version control systems. Databricks offers direct integrations with all major providers such as GitLab, GitHub and Azure DevOps. Moreover, production branches need to be protected and before deployment in the production environment we should both automatically and manually validate the code.

3. Data
Data in production is often confidential and requires protection from unauthorised access. This involves implementing access controls, ensuring data masking, and using the encryption features for both at rest and in transit data.

Another factor to consider is that production environments often depend on various source systems for data ingestion. We need to ensure that these dependencies are managed through reliable and secure connections. We can also consider setting up monitoring for these source systems as well to anticipate and mitigate potential issues that could affect data ingestion and processing.

4. Compute
Databricks offers a wide range of cluster types, sizes, and configurations. For running production workloads, we should use dedicated job clusters to ensure isolation and consistent performance. Spot instances are not a good choice because they can be reclaimed at any time, leading to potential disruption of critical tasks. Instead, by using dedicated instances, we can ensure stable and reliable performance.

Auto-scaling is also a feature worth considering, but in my experience, we need to carefully evaluate its use. While it can save costs by adjusting resources based on demand, we should assess the variability in the load to avoid unnecessary latency and instability due to up and down scaling.

Photon is Databricks’s vectorised query engine that supports both SQL workloads and DataFrame API calls. Photon makes vectorised operations significantly faster but is also twice as expensive and has several limitations, such as no support for UDFs and Structured Streaming. Therefore, before enabling it, we should carefully benchmark the code to see if the performance improvements are worth it and if we are mainly using the supported operators, expressions, and data types. If this is not the case, then the default execution engine is the better choice.

We should also regularly monitor cluster performance and adjust configurations based on workload requirements to maintain efficiency in production environments. Additionally, we should use either Databricks’s built-in notification mechanism or another third-party tool to alert the responsible parties if issues come up.

5. Governance and Security
From organisational governance (identity and role management, access control, permissions, etc.) to data governance (data discovery, access, lineage, sharing, auditing, metadata management, etc.) and network security, there is a lot to take into account for productive environments.

As these are quite extensive topics, I would like to refer you to my other two articles, which cover governance and networking in detail:

In essence, Databricks currently offers two distinct features sets for governance: The classical Hive Metastore and Unity Catalog. Regardless of the tool, we should ensure limited access to the code and the environment. Additionally, all actions and assets within the production workspaces should eventually be managed by automation tools to prevent manual errors.

Networking security also plays and important role and we can improve the security of our workspaces by deploying them in secure virtual networks, restricting inbound and outbound traffic, and using private endpoints for accessing storage and other services. Regular audits and monitoring of logs are also important to detect and respond to potential security incidents rapidly.

How to “Promote” Code from One Environment to the Next

The end goal is to develop the code, automatically test it, and move it to the next environment until it reaches production. In Databricks, we have four options to do this:

1. Manually
The easiest way is to simply copy the files manually from one environment to the next. Before investing in automation, this can be a good first step to understand what needs to be moved and configured. Manual tasks are prone to errors and are not scalable for larger projects, but understanding what needs to happen can ensure that we set up automation that actually fulfils our requirements.

2. Copying Code from One Environment to the Next Using a CI/CD Tool
We can integrate Databricks with CI/CD tools like Azure DevOps, Jenkins, or GitHub Actions. In these tools, we can create pipelines that run unit, integration, and performance tests, and then copy the code to the next environment if all tests pass. Historically, these pipelines automated the manual movement of files. Now, instead of relying on placing the right files in the right locations we have a more “reliable” approach: Git Folders

3. Syncing Files Across Environments with Git Folders (Repos)
Using Git Folders is a generally a good idea for collaboration and version control, but we can also use them to sync environments. We can set up branches for different environments and use pull requests to promote code. This is now also the main approach described in the Databricks documentation.

4. Deploying Code Using Asset Bundles
Asset Bundles are packages that contain all the necessary components for a Databricks project, including notebooks, libraries, configurations, and any other dependencies. They are Databricks’s approach to Infrastructure as Code (IaC). The advantage of Asset Bundles over the first three approaches is that we can deploy all kinds of artefacts, such as jobs and clusters in one go, which was previously more difficult to do. However, they are also much more complex to set up and create some overhead if the only thing we want is a pipeline for the code itself.

Common Data Engineering Challenges and How To Solve Them in Databricks

In addition to the fundamental aspects of data engineering solutions, there are some common topics which will present themselves in one form or the other during the development process. In this article, I will present three main ones and how we can address them in Databricks:

  • Change Data Capture
  • Partitioning
  • Collaborative Data Engineering Development

Change Data Capture

Before reaching the end consumer, data usually moves through several layers, each with different degrees of quality and refinement. Databricks recommends using the Medallion Architecture (Bronze-Silver-Gold).

Identifying and selecting the right data from the previous layer is a fundamental problem in data engineering, implemented in various ways in different systems. Most of the time, we don’t want to reprocess the entire dataset but only the parts that have changed since the last run. This is therefore called Change Data Capture (CDC).

In Databricks we have 4 main ways to perform CDC:

1. Custom Watermark Values
The first option we have involves using a custom field to identify records that have not been previously processed. For example, if we have an incremental ID, we can select the maximum ID processed so far in the layer in question. Then, we only select the records from the previous layer that have an ID higher than that.

However, we need to be careful with late-arriving data. If we use the maximal value that we have processed so far, the system will not look for records with a smaller ID. This means that even if those records have not been processed yet because they arrived after the last time we ran the pipeline, they will be missed.

Because of this, it is recommended to use a value that is generated once the data reaches the processing system. For example, we can create a generated timestamp column with the moment when the data is ingested into Databricks. This approach removes the dependency on the ingestion system.

Example:

# Determine the maximum ingestion_timestamp processed so far
max_timestamp = spark.sql("SELECT MAX(ingestion_timestamp) FROM target_table").collect()[0][0]
# Select records from the source_table that have a higher ingestion_timestamp than the max processed
new_records_df = spark.sql(f"""
SELECT * FROM source_table
WHERE ingestion_timestamp > '{max_timestamp}'
"""
)

2. Using MERGE INTO
The “MERGE INTO” statement in Delta Lake allows us to perform upserts (update and insert) in a single command. The command matches records between a source table and a target table, updating existing records and inserting new ones.

Internally, the merge statement performs an inner join between the target and source tables to identify matches and an outer join to apply the changes. This can be resource-intensive, especially with large datasets. In theory, we could load the entire source layer into memory and then merge it with the target layer to only insert the newest records. In reality, this will not work except for very small datasets because most tables will not fit into memory and this will lead to disk spill, drastically decreasing the performance of the operations.

However, we can use “MERGE INTO” as our CDC mechanism in cases where we can reduce the size of the source and target datasets to a degree that they fit into memory. For this, we can use partition pruning and predicate pushdown to reduce the amount of data read.

For example, if we know we are only processing the latest date and we are partitioning on the date column, then we can efficiently select only the date in question. Predicate pushdown works similarly by including the filters in the read request but not necessarily on partition columns. However, predicate pushdown will only work on data sources that support it, such as Parquet, JDBC, and Delta Lake, and not on text, JSON, or XML.

Nonetheless, if we can use both in our merge statements to only select a subset of the source and target data, merges can be a viable option for CDC.

Example:

date_in_question = "20240101"
status_filter = "active"
additional_filter = "processed"

spark.sql(f"""
MERGE INTO target
USING (
-- Partition Pruning: Filter on partitioned column 'date'
SELECT * FROM source WHERE date = '{date_in_question}'
-- Predicate Pushdown: Filter on non-partitioned column 'status'
AND status = '{status_filter}'
) AS source
ON target.id = source.id AND target.date = '{date_in_question}'
AND target.status = '{additional_filter}'
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
"""
)

3. Delta Change Data Feed
Delta Change Data Feed (CDF) allows us to capture changes to Delta tables (inserts, updates, and deletes). We can enable CDF for individual tables:

SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

Or for all new tables:

set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;

Once we have enabled CDF, we can read change feed by specifying the start/end versions or timestamps:

# version
spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 0) \
.option("endingVersion", 5) \
.table("aDeltaTable")
# timestamps as formatted timestamp
spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingTimestamp", '2023-01-01 01:00:00') \
.option("endingTimestamp", '2023-12-31 02:00:00') \
.table("aDeltaTable")
# providing only the startingVersion/timestamp
spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 1) \
.table("aDeltaTable")
CDF Output in Databricks

The resulting DataFrame will contain the actual values and the following columns:

  • _change_type: insertupdate_preimage , update_postimagedelete
  • _comit_version: The table version when the commit took place
  • _commit_timestamp: The timestamp when the commit took place

If we know for sure that we only had one new batch of data since the last run, we can simply select the rows that have the latest commit value and the _change_type = update_postimage. If multiple processing iterations took place, we need to store the latest version we have processed in some form to select all relevant commits.

Similarly, we could use the time travel functionality of delta tables to select a specific version of the tables. However, CDF gives us a more comprehensive overview where we can compare the different versions of individual records in one place. Nonetheless, we should not rely on the implicitly stored history for critical workloads. If we need comprehensive and long-term records, we should explicitly save the change data feed.

4. Spark Structured Streaming
Spark Structured Streaming offers built-in state management capabilities. It automatically determines the newest data through checkpointing. This way, we don’t need to manually handle CDC. In Databricks, we also have AutoLoader (built on top of Structured Streaming) for file ingestion.

We can run AutoLoader in either File Notification Mode, which subscribes to the storage account’s notification queue to identify new files, or Directory Listing Mode, which lists files to check if they have been processed. Either way, there is no need for manual CDC.

For subsequent layers, we can also use Structured Streaming. Historically, streaming was designed for real-time or near-real-time processing, requiring clusters to run continuously. However, we now have the option of “Streaming in Batches”.

We can benefit from all the functionality of Structured Streaming without having clusters run continuously by scheduling jobs to trigger the pipeline at certain intervals and using the trigger = AvailableNow to only process currently available data. This way, Structured Streaming will not wait for new data, and the cluster will shut down as soon as the current data is processed. To apply transformations, we can use the forEachBatch option for each microbatch.

Moreover, with Unity Catalog, we now have job triggers based on file arrival for jobs. This allows us to set up an end-to-end streaming pipeline that runs in batches.

I personally find it more difficult to debug streaming pipelines compared to batch ones. However, for simpler logic where we can depict the transformations from one layer to the next in a single function, this approach can be very useful.

Partitioning

Historically, partitioning was essential for organising large datasets and improving query performance in data lakes for both reads and writes. However, Databricks now advises against manually partitioning tables smaller than 1 TB.

The reason is that even the best partitioning schemes, which might have been perfect for the initial data product, can become problematic as the dataset and query behaviour evolve. Designing a good partitioning scheme and adapting it over time required significant manual effort.

Because of this, Databricks has invested a lot in “logical” data organisation techniques, such as ingestion time clusteringZ-order indexing, and liquid clustering. These methods dynamically optimise data layout, improving query performance and simplifying data management without the need for static partitioning strategies.

A very good overview of the differences between Delta Lake and Hive-style partitioning and the thinking behind Databricks’s approach can be found here: Partitioning in Databricks

Collaborative Development of Data Engineering Solutions

Developing Data Engineering solutions as a team is inherently difficult. It’s neither Data Science / Machine Learning development nor “classical” software development. I will not focus on the topic too much but I find Niels Cautaerts take on the matter particularly insightful (Data Engineering is Not Software Engineering).

If we want to develop a DE solution on Databricks as a team, there are three main areas we need to think about:

  1. Sharing Data
    If users need to share data, leveraging the shallow and deep clone functionality of Delta Lake can be an efficient solution to avoid conflicts. Shallow clones reference the original data without duplicating it, making them useful for working with the data structure without incurring extra storage costs. Deep clones, on the other hand, duplicate the entire dataset, allowing modifications independent of the original source. Additionally, if developers work in different branches, one approach I have tested involves creating a utility function to automatically extract the active branch name and create/reference clones of the data throughout the pipeline.
  2. Sharing Compute
    Sharing compute resources can be a cost-effective way solution which also allows us to enforce policies on a single level, but it also comes with certain limitations. For example, if developers require different library versions on the cluster, more compute power for testing specific tasks, or have varying roles with different permissions, conflicts can come up. Therefore, I would not recommend a single approach without careful consideration of these factors.
  3. Sharing Code
    In Databricks workspaces, code assets are stored at several levels: top-level workspace folders, shared workspace folders, user folders, and Repos (or Git) folders. Regardless of where development happens, each workspace should have policies in place to restrict access and manage read/write permissions to specific locations. Git folders support most Git workflows, including merges, diff views, resolving conflicts, and others, making them a valuable tool for collaborative development.

A more significant challenge arises when developing pipelines that span multiple systems. For instance, involving the source application team in development, or using another tool such as Synapse or Fabric for the gold layer, exponentially increases the difficulty. This complexity affects versioning, aligning data states, orchestration, debugging, etc.

Cross-system solutions require a different approach compared to single-system solutions. The good news is that Databricks supports and integrates well with other tools via its SDK, API, or CLI. This flexibility allows the team to decide the extent to which they want to use Databricks for the pipeline. I have seen cases where Databricks is used more as an execution engine rather than a development environment, which can also be a very valid approach.

Conclusion

Every article, book or video on the topics I have just described inevitably cannot cover the complexity of real life scenarios. There are too many variables in play and every situation will need a tailored approach. Nonetheless, I truly believe that learning about common considerations and understanding the whole picture can help us all build better products.

I hope you enjoyed the article and in case you found it useful, consider sharing it with others who might appreciate it!

All the best,

Comments

Popular posts from this blog

Flutter for Single-Page Scrollable Websites with Navigator 2.0

A Data Science Portfolio is More Valuable than a Resume

Better File Storage in Oracle Cloud