Apache Airflow is an open-source orchestrator that lets you create, manage, and schedule data pipelines - all in Python. It's a must-know technology for data engineers, but also data scientists. Most data science jobs, at least at smaller companies, involve some degree of data engineering. If you have a decent amount of pipelines to manage, look no further than Airflow.
Today you'll learn how to install Apache Airflow on your PC and how to configure the environment. It doesn't matter what OS you're using, as long as you have Python installed and know how to create virtual environments.
For a reference point, I'm writing this guide on macOS 12 with Miniforge installed.
Don't feel like reading? Watch my video instead:
Create a Python Virtual Environment for Apache Airflow
First things first, we'll create a new Python virtual environment. I'll call it airflow_env
and base it on Python 3.9, but you're welcome to change both:
conda create --name airflow_env python=3.9 -y
Activate the environment using the following command:
conda activate airflow_env
Here's the output you should see in Terminal:
The virtual environment creation process may be different if you're not using Miniforge, but I assume you can manage that.
Let's see how to install Airflow next.
Install Apache Airflow
The latest Airflow version is 2.2.3, and that's the version we'll install. The installation command depends both on the Airflow and Python versions, as we have to specify a path to the constraints file.
I've created an environment based on Python 3.9, so the constraints file path looks like this:
https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.9.txt
If your Airflow or Python versions are different, make sure to change the 2.2.3
and/or 3.9
in the URL accordingly. Open it in your browser and make sure you're not getting any Not found errors.
Assuming everything works, copy the following command to install Apache Airflow:
pip install "apache-airflow==2.2.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.9.txt"
It will take some time to finish, as there are a ton of dependencies:
Finished! The installation until this point looks exactly the same as when installing any other Python package. That's about to change. In the following section, you'll see how to set up the Airflow database and user.
Setup Airflow Database and User
Once you have Airflow installed, initialize the database with the following Terminal command:
airflow db init
It will create the airflow
folder in your root directory, so navigate to it:
cd ~/airflow
ls
Here are the files:
The airflow.db
is the Metastore Airflow uses, and you'll see how to access it at the end of the article. You'll also see how to edit airflow.cfg
, and why should you do it.
But first, let's create an Airflow user:
airflow users create \
--username admin \
--password admin \
--firstname <FirstName> \
--lastname <LastName> \
--role Admin \
--email <YourEmail>
User creation will take a couple of seconds. Once done you should see that the user with the Admin role was successfully created:
And that's about all for the basic configuration. Let's see how you can run Airflow next.
Start Airflow Webserver and Scheduler
Apache Airflow consists of two core parts - Webserver and Scheduler. You'll have to run both to inspect and run your DAGs.
First, start the Webserver in the daemon mode (as a background process):
airflow webserver -D
Once it's running, use a similar command to run the Scheduler:
airflow scheduler -D
Airflow runs on port 8080
by default, so open the following URL in your browser:
http://localhost:8080
And viola - you'll see the Sign in window:
Use credentials specified when creating the Airflow user (admin/admin
) and hit the Sign in button:
We're in. By default, Airflow lists 30 example DAGs you can use as a reference point when writing your own. I don't like seeing them, so I'll show you how to get rid of them in the following section.
Bonus: How to Remove Airflow Example DAGs
You could delete the DAGs one by one, but there's a better approach. Open the ~/airflow/airflow.cfg
file and change the load_examples
value to False
:
Save the file and reopen the terminal. You'll have to reset the Airflow database:
airflow db reset
Once done, start the Airflow webserver and scheduler once again:
airflow webserver -D
airflow scheduler -D
Hint: If Terminal tells you Airflow is already running, get the process ID of the task running on port 8080 (lsof -i tcp:8080
) and then use kill <pid>
to terminate the process. Now you should be able to run both Webserver and Scheduler.
Here's what the homepage looks like on my end:
Neat and clean - just what we want for the upcoming articles. Before wrapping up this one, I also want to show you what's inside Airflow's database.
Bonus: What's Inside the Airflow SQLite Database?
Our Airflow instance is using SQLite for the database. It's not recommended for production environments, but will serve us fine for local development. Establish a connection to ~/airflow/airflow.db
through some DBMS (I'm using TablePlus) and run the following SQL query:
SELECT * FROM sqlite_master
WHERE type = 'table';
These are all the tables Airflow created behind the scenes:
No comments:
Post a Comment