Friday, December 30, 2022

How to Install Airflow Locally

 Apache Airflow is an open-source orchestrator that lets you create, manage, and schedule data pipelines - all in Python. It's a must-know technology for data engineers, but also data scientists. Most data science jobs, at least at smaller companies, involve some degree of data engineering. If you have a decent amount of pipelines to manage, look no further than Airflow.

Today you'll learn how to install Apache Airflow on your PC and how to configure the environment. It doesn't matter what OS you're using, as long as you have Python installed and know how to create virtual environments.

For a reference point, I'm writing this guide on macOS 12 with Miniforge installed.

Don't feel like reading? Watch my video instead:


Create a Python Virtual Environment for Apache Airflow

First things first, we'll create a new Python virtual environment. I'll call it airflow_env and base it on Python 3.9, but you're welcome to change both:

conda create --name airflow_env python=3.9 -y

Activate the environment using the following command:

conda activate airflow_env

Here's the output you should see in Terminal:

Image 1 - Creating and activating a Conda virtual environment for Airflow (image by author)
Image 1 - Creating and activating a Conda virtual environment for Airflow (image by author)

The virtual environment creation process may be different if you're not using Miniforge, but I assume you can manage that.

Let's see how to install Airflow next.


Install Apache Airflow

The latest Airflow version is 2.2.3, and that's the version we'll install. The installation command depends both on the Airflow and Python versions, as we have to specify a path to the constraints file.

I've created an environment based on Python 3.9, so the constraints file path looks like this:

https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.9.txt

If your Airflow or Python versions are different, make sure to change the 2.2.3 and/or 3.9 in the URL accordingly. Open it in your browser and make sure you're not getting any Not found errors.

Assuming everything works, copy the following command to install Apache Airflow:

pip install "apache-airflow==2.2.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.9.txt"

It will take some time to finish, as there are a ton of dependencies:

Image 2- Installing Apache Airflow with all dependencies (image by author)
Image 2- Installing Apache Airflow with all dependencies (image by author)

Finished! The installation until this point looks exactly the same as when installing any other Python package. That's about to change. In the following section, you'll see how to set up the Airflow database and user.


Setup Airflow Database and User

Once you have Airflow installed, initialize the database with the following Terminal command:

airflow db init
Image 3 - Initializing the Airflow database (image by author)
Image 3 - Initializing the Airflow database (image by author)

It will create the airflow folder in your root directory, so navigate to it:

cd ~/airflow
ls

Here are the files:

Image 4 - Airflow root directory (image by author)
Image 4 - Airflow root directory (image by author)

The airflow.db is the Metastore Airflow uses, and you'll see how to access it at the end of the article. You'll also see how to edit airflow.cfg, and why should you do it.

But first, let's create an Airflow user:

airflow users create \ 
    --username admin \
    --password admin \
    --firstname <FirstName> \
    --lastname <LastName> \
    --role Admin \
    --email <YourEmail>
Image 5 - Creating the Airflow user (image by author)
Image 5 - Creating the Airflow user (image by author)

User creation will take a couple of seconds. Once done you should see that the user with the Admin role was successfully created:

Image 6 - Creating the Airflow user (2) (image by author)
Image 6 - Creating the Airflow user (2) (image by author)

And that's about all for the basic configuration. Let's see how you can run Airflow next.


Start Airflow Webserver and Scheduler

Apache Airflow consists of two core parts - Webserver and Scheduler. You'll have to run both to inspect and run your DAGs.

First, start the Webserver in the daemon mode (as a background process):

airflow webserver -D
Image 7 - Starting Airflow web server (image by author)
Image 7 - Starting Airflow webserver (image by author)

Once it's running, use a similar command to run the Scheduler:

airflow scheduler -D

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...