Create a development environment and start building DAGs
Although being pretty late to the party (Airflow became an Apache Top-Level Project in 2019), I still had trouble finding an easy-to-understand, up-to-date, and lightweight solution to installing Airflow.
Today, we’re about to change all that.
In the following sections, we will create a lightweight, standalone, and easily deployed Apache Airflow development environment in just a few minutes.
Docker-Compose will be our close companion, allowing us to create a smooth development workflow with quick iteration cycles. Simply spin up a few docker containers and we can start to create our own workflows.
Note: The following setup will not be suitable for any production purposes and is intended to be used in a development environment only.
Why Airflow?
Apache Airflow is a batch-oriented framework that allows us to easily build scheduled data pipelines in Python. Think of “workflow as code” capable of executing any operation we can implement in Python.
Airflow is not a data processing tool itself. It’s an orchestration software. We can imagine Airflow as some kind of spider in a web. Sitting in the middle, pulling all the strings and coordinating the workload of our data pipelines.
A data pipeline typically consists of several tasks or actions that need to be executed in a specific order. Apache Airflow models such a pipeline as a DAG (directed acyclic graph). A graph with directed edges or tasks without any loops or cycles.
This approach allows us to run independent tasks in parallel, saving time and money. Moreover, we can split a data pipeline into several smaller tasks. If a job fails, we can only rerun the failed and the downstream tasks, instead of executing the complete workflow all over again.
Airflow is composed of three main components:
- Airflow Scheduler — the “heart” of Airflow, that parses the DAGs, checks the scheduled intervals, and passes the tasks over to the workers.
- Airflow Worker — picks up the tasks and actually performs the work.
- Airflow Webserver — provides the main user interface to visualize and monitor the DAGs and their results.
Step-By-Step Installation
Now that we shortly introduced Apache Airflow, it’s time to get started.
Step 0: Prerequisites
Since we will use docker-compose to get Airflow up and running, we have to install Docker first. Simply head over to the official Docker site and download the appropriate installation file for your OS.
Step 1: Create a new folder
We start nice and slow by simply creating a new folder for Airflow.
Just navigate via your preferred terminal to a directory, create a new folder, and change into it by running:
mkdir airflow
cd airflow
Step 2: Create a docker-compose file
Next, we need to get our hands on a docker-compose file that specifies the required services or docker containers.
Via the terminal, we can run the following command inside the newly created Airflow folder
curl https://raw.githubusercontent.com/marvinlanhenke/Airflow/main/01GettingStarted/docker-compose.yml -o docker-compose.yml
or simply create a new file named docker-compose.yml
and copy the below content.
The above docker-compose file simply specifies the required services we need to get Airflow up and running. Most importantly the scheduler, the webserver, the metadatabase (postgreSQL), and the airflow-init job initializing the database.
At the top of the file, we make use of some local variables that are commonly used in every docker container or service.
Step 3: Environment variables
We successfully created a docker-compose file with the mandatory services inside. However, to complete the installation process and configure Airflow properly, we need to provide some environment variables.
Still, inside your Airflow folder create a .env
file with the following content:
The above variables set the database credentials, the airflow user, and some further configurations.
Most importantly, the kind of executor Airflow we will utilize. In our case, we make use of the LocalExecutor
.
Note: More information on the different kinds of executors can be found here.
Step 4: Run docker-compose
And this is already it!
Just head over to the terminal and spin up all the necessary containers by running
docker compose up -d
After a short period of time, we can check the results and the Airflow Web UI by visiting http://localhost:8080
. Once we sign in with our credentials (airflow: airflow) we gain access to the user interface.
A Quick Test
With a working Airflow environment, we can now create a simple DAG for testing purposes.
First of all, make sure to run pip install apache-airflow
to install the required Python modules.
Now, inside your Airflow folder, navigate to dags
and create a new file called sample_dag.py
.
We define a new DAG and some pretty simple tasks.
The EmptyOperator
serves no real purpose other than to create a mockup task inside the Web UI. By utilizing the BashOperator
, we create a somewhat creative output of “HelloWorld!”. This allows us to visually confirm a proper running Airflow setup.
Save the file and head over to the Web UI. We can now start the DAG by manually triggering it.
Note: It may take a while before your DAG appears in the UI. We can speed things up by running the following command in our terminal
docker exec -it --user airflow airflow-scheduler bash -c "airflow dags list"
Running the DAG shouldn’t take any longer than a couple of seconds.
Once finished, we can navigate to XComs
and inspect the output.
And this is it!
We successfully installed Airflow with docker-compose and gave it a quick test ride.
Note: We can stop the running containers by simply executing
docker compose down
.
Conclusion
Airflow is a batch-oriented framework that allows us to create complex data pipelines in Python.
In this article, we created a simple and easy-to-use environment to quickly iterate and develop new workflows in Apache Airflow. By leveraging docker-compose we can get straight to work and code new workflows.
However, such an environment should only be used for development purposes and is not suitable for any production environment that requires a more sophisticated and distributed setup of Apache Airflow.
You can find the full code here on my GitHub.
No comments:
Post a Comment