Apache Airflow is an open-source orchestrator that lets you create, manage, and schedule data pipelines - all in Python. It's a must-know technology for data engineers, but also data scientists. Most data science jobs, at least at smaller companies, involve some degree of data engineering. If you have a decent amount of pipelines to manage, look no further than Airflow.
Today you'll learn how to install Apache Airflow on your PC and how to configure the environment. It doesn't matter what OS you're using, as long as you have Python installed and know how to create virtual environments.
For a reference point, I'm writing this guide on macOS 12 with Miniforge installed.
Don't feel like reading? Watch my video instead:
Create a Python Virtual Environment for Apache Airflow
First things first, we'll create a new Python virtual environment. I'll call it airflow_env
and base it on Python 3.9, but you're welcome to change both:
conda create --name airflow_env python=3.9 -y
Activate the environment using the following command:
conda activate airflow_env
Here's the output you should see in Terminal:
The virtual environment creation process may be different if you're not using Miniforge, but I assume you can manage that.
Let's see how to install Airflow next.
Install Apache Airflow
The latest Airflow version is 2.2.3, and that's the version we'll install. The installation command depends both on the Airflow and Python versions, as we have to specify a path to the constraints file.
I've created an environment based on Python 3.9, so the constraints file path looks like this:
https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.9.txt
If your Airflow or Python versions are different, make sure to change the 2.2.3
and/or 3.9
in the URL accordingly. Open it in your browser and make sure you're not getting any Not found errors.
Assuming everything works, copy the following command to install Apache Airflow:
pip install "apache-airflow==2.2.3" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-no-providers-3.9.txt"
It will take some time to finish, as there are a ton of dependencies:
Finished! The installation until this point looks exactly the same as when installing any other Python package. That's about to change. In the following section, you'll see how to set up the Airflow database and user.
Setup Airflow Database and User
Once you have Airflow installed, initialize the database with the following Terminal command:
airflow db init
It will create the airflow
folder in your root directory, so navigate to it:
cd ~/airflow
ls
Here are the files:
The airflow.db
is the Metastore Airflow uses, and you'll see how to access it at the end of the article. You'll also see how to edit airflow.cfg
, and why should you do it.
But first, let's create an Airflow user:
airflow users create \
--username admin \
--password admin \
--firstname <FirstName> \
--lastname <LastName> \
--role Admin \
--email <YourEmail>
User creation will take a couple of seconds. Once done you should see that the user with the Admin role was successfully created:
And that's about all for the basic configuration. Let's see how you can run Airflow next.
Start Airflow Webserver and Scheduler
Apache Airflow consists of two core parts - Webserver and Scheduler. You'll have to run both to inspect and run your DAGs.
First, start the Webserver in the daemon mode (as a background process):
airflow webserver -D
Once it's running, use a similar command to run the Scheduler:
airflow scheduler -D
No comments:
Post a Comment