Friday, December 30, 2022

Setting Up Apache Airflow with Docker-Compose in 5 Minutes

 

Create a development environment and start building DAGs

Although being pretty late to the party (Airflow became an Apache Top-Level Project in 2019), I still had trouble finding an easy-to-understand, up-to-date, and lightweight solution to installing Airflow.

Today, we’re about to change all that.

In the following sections, we will create a lightweight, standalone, and easily deployed Apache Airflow development environment in just a few minutes.

Docker-Compose will be our close companion, allowing us to create a smooth development workflow with quick iteration cycles. Simply spin up a few docker containers and we can start to create our own workflows.

Note: The following setup will not be suitable for any production purposes and is intended to be used in a development environment only.

Why Airflow?

Apache Airflow is a batch-oriented framework that allows us to easily build scheduled data pipelines in Python. Think of “workflow as code” capable of executing any operation we can implement in Python.

Airflow is not a data processing tool itself. It’s an orchestration software. We can imagine Airflow as some kind of spider in a web. Sitting in the middle, pulling all the strings and coordinating the workload of our data pipelines.

A data pipeline typically consists of several tasks or actions that need to be executed in a specific order. Apache Airflow models such a pipeline as a DAG (directed acyclic graph). A graph with directed edges or tasks without any loops or cycles.

This approach allows us to run independent tasks in parallel, saving time and money. Moreover, we can split a data pipeline into several smaller tasks. If a job fails, we can only rerun the failed and the downstream tasks, instead of executing the complete workflow all over again.

Airflow is composed of three main components:

Step-By-Step Installation

Now that we shortly introduced Apache Airflow, it’s time to get started.

Step 0: Prerequisites

Since we will use docker-compose to get Airflow up and running, we have to install Docker first. Simply head over to the official Docker site and download the appropriate installation file for your OS.

Step 1: Create a new folder

We start nice and slow by simply creating a new folder for Airflow.

Just navigate via your preferred terminal to a directory, create a new folder, and change into it by running:

Step 2: Create a docker-compose file

Next, we need to get our hands on a docker-compose file that specifies the required services or docker containers.

Via the terminal, we can run the following command inside the newly created Airflow folder

or simply create a new file named docker-compose.yml and copy the below content.

The above docker-compose file simply specifies the required services we need to get Airflow up and running. Most importantly the scheduler, the webserver, the metadatabase (postgreSQL), and the airflow-init job initializing the database.

At the top of the file, we make use of some local variables that are commonly used in every docker container or service.

Step 3: Environment variables

We successfully created a docker-compose file with the mandatory services inside. However, to complete the installation process and configure Airflow properly, we need to provide some environment variables.

Still, inside your Airflow folder create a .env file with the following content:

The above variables set the database credentials, the airflow user, and some further configurations.

Most importantly, the kind of executor Airflow we will utilize. In our case, we make use of the LocalExecutor.

Note: More information on the different kinds of executors can be found here.

Step 4: Run docker-compose

And this is already it!

Just head over to the terminal and spin up all the necessary containers by running

After a short period of time, we can check the results and the Airflow Web UI by visiting http://localhost:8080. Once we sign in with our credentials (airflow: airflow) we gain access to the user interface.

A Quick Test

With a working Airflow environment, we can now create a simple DAG for testing purposes.

First of all, make sure to run pip install apache-airflow to install the required Python modules.

Now, inside your Airflow folder, navigate to dags and create a new file called sample_dag.py.

We define a new DAG and some pretty simple tasks.

The EmptyOperator serves no real purpose other than to create a mockup task inside the Web UI. By utilizing the BashOperator, we create a somewhat creative output of “HelloWorld!”. This allows us to visually confirm a proper running Airflow setup.

Save the file and head over to the Web UI. We can now start the DAG by manually triggering it.

Note: It may take a while before your DAG appears in the UI. We can speed things up by running the following command in our terminal docker exec -it --user airflow airflow-scheduler bash -c "airflow dags list"

Running the DAG shouldn’t take any longer than a couple of seconds.

Once finished, we can navigate to XComs and inspect the output.

And this is it!

We successfully installed Airflow with docker-compose and gave it a quick test ride.

Note: We can stop the running containers by simply executing docker compose down.

Conclusion

Airflow is a batch-oriented framework that allows us to create complex data pipelines in Python.

In this article, we created a simple and easy-to-use environment to quickly iterate and develop new workflows in Apache Airflow. By leveraging docker-compose we can get straight to work and code new workflows.

However, such an environment should only be used for development purposes and is not suitable for any production environment that requires a more sophisticated and distributed setup of Apache Airflow.

You can find the full code here on my GitHub.

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...