Friday, March 18, 2022

Apache Airflow for Data Science - How to Install Airflow Locally

 Apache Airflow is an open-source orchestrator that lets you create, manage, and schedule data pipelines - all in Python. It's a must-know technology for data engineers, but also data scientists. Most data science jobs, at least at smaller companies, involve some degree of data engineering. If you have a decent amount of pipelines to manage, look no further than Airflow.

Today you'll learn how to install Apache Airflow on your PC and how to configure the environment. It doesn't matter what OS you're using, as long as you have Python installed and know how to create virtual environments.

For a reference point, I'm writing this guide on macOS 12 with Miniforge installed.

Don't feel like reading? Watch my video instead:


Create a Python Virtual Environment for Apache Airflow

First things first, we'll create a new Python virtual environment. I'll call it airflow_env and base it on Python 3.9, but you're welcome to change both:

conda create --name airflow_env python=3.9 -y

Activate the environment using the following command:

conda activate airflow_env

Here's the output you should see in Terminal:

Image 1 - Creating and activating a Conda virtual environment for Airflow (image by author)
Image 1 - Creating and activating a Conda virtual environment for Airflow (image by author)

The virtual environment creation process may be different if you're not using Miniforge, but I assume you can manage that.

Let's see how to install Airflow next.


Install Apache Airflow

The latest Airflow version is 2.2.3, and that's the version we'll install. The installation command depends both on the Airflow and Python versions, as we have to specify a path to the constraints file.

I've created an environment based on Python 3.9, so the constraints file path looks like this:

https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.9.txt

If your Airflow or Python versions are different, make sure to change the 2.2.3 and/or 3.9 in the URL accordingly. Open it in your browser and make sure you're not getting any Not found errors.

Assuming everything works, copy the following command to install Apache Airflow:

pip install "apache-airflow==2.2.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.9.txt"

It will take some time to finish, as there are a ton of dependencies:

Image 2- Installing Apache Airflow with all dependencies (image by author)
Image 2- Installing Apache Airflow with all dependencies (image by author)

Finished! The installation until this point looks exactly the same as when installing any other Python package. That's about to change. In the following section, you'll see how to set up the Airflow database and user.


Setup Airflow Database and User

Once you have Airflow installed, initialize the database with the following Terminal command:

airflow db init
Image 3 - Initializing the Airflow database (image by author)
Image 3 - Initializing the Airflow database (image by author)

It will create the airflow folder in your root directory, so navigate to it:

cd ~/airflow
ls

Here are the files:

Image 4 - Airflow root directory (image by author)
Image 4 - Airflow root directory (image by author)

The airflow.db is the Metastore Airflow uses, and you'll see how to access it at the end of the article. You'll also see how to edit airflow.cfg, and why should you do it.

But first, let's create an Airflow user:

airflow users create \ 
    --username admin \
    --password admin \
    --firstname <FirstName> \
    --lastname <LastName> \
    --role Admin \
    --email <YourEmail>
Image 5 - Creating the Airflow user (image by author)
Image 5 - Creating the Airflow user (image by author)

User creation will take a couple of seconds. Once done you should see that the user with the Admin role was successfully created:

Image 6 - Creating the Airflow user (2) (image by author)
Image 6 - Creating the Airflow user (2) (image by author)

And that's about all for the basic configuration. Let's see how you can run Airflow next.


Start Airflow Webserver and Scheduler

Apache Airflow consists of two core parts - Webserver and Scheduler. You'll have to run both to inspect and run your DAGs.

First, start the Webserver in the daemon mode (as a background process):

airflow webserver -D
Image 7 - Starting Airflow web server (image by author)
Image 7 - Starting Airflow webserver (image by author)

Once it's running, use a similar command to run the Scheduler:

airflow scheduler -D
Image 8 - Starting Airflow scheduler (image by author)
Image 8 - Starting Airflow scheduler (image by author)

Airflow runs on port 8080 by default, so open the following URL in your browser:

http://localhost:8080

And viola - you'll see the Sign in window:

Image 9 - Airflow login page (image by author)
Image 9 - Airflow login page (image by author)

Use credentials specified when creating the Airflow user (admin/admin) and hit the Sign in button:

Image 10 - Airflow home page (image by author)
Image 10 - Airflow home page (image by author)

We're in. By default, Airflow lists 30 example DAGs you can use as a reference point when writing your own. I don't like seeing them, so I'll show you how to get rid of them in the following section.


Bonus: How to Remove Airflow Example DAGs

You could delete the DAGs one by one, but there's a better approach. Open the ~/airflow/airflow.cfg file and change the load_examples value to False:

Image 11 - Editing airflow.cfg file (image by author)
Image 11 - Editing airflow.cfg file (image by author)

Save the file and reopen the terminal. You'll have to reset the Airflow database:

airflow db reset

Once done, start  the Airflow webserver and scheduler once again:

airflow webserver -D
airflow scheduler -D

Hint: If Terminal tells you Airflow is already running, get the process ID of the task running on port 8080 (lsof -i tcp:8080) and then use kill <pid> to terminate the process. Now you should be able to run both Webserver and Scheduler.

Here's what the homepage looks like on my end:

Image 12- Airflow home page with no example DAGs (image by author)
Image 12- Airflow home page with no example DAGs (image by author)

Neat and clean - just what we want for the upcoming articles. Before wrapping up this one, I also want to show you what's inside Airflow's database.


Bonus: What's Inside the Airflow SQLite Database?

Our Airflow instance is using SQLite for the database. It's not recommended for production environments, but will serve us fine for local development. Establish a connection to ~/airflow/airflow.db through some DBMS (I'm using TablePlus) and run the following SQL query:

SELECT * FROM sqlite_master
WHERE type = 'table';

These are all the tables Airflow created behind the scenes:

Image 13 - Contents of the Airflow SQLite database (image by author)
Image 13 - Contents of the Airflow SQLite database (image by author)

Summary and Next Steps

Today you've successfully installed and configured Apache Airflow on your local machine. You've also learned how to turn off the example DAGs and how to access Airflow's database. To be fair, you could download a preconfigured VM or Docker image, but I see no downside in knowing how to set up Airflow from scratch.

In the following article, I'll show you how to write, execute, and schedule a basic DAG with Airflow and Python, so stay tuned.

Stay connected

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...