What is AirFlow
Apache Airflow is a server side open source workflow management platform for data engineering pipelines. Started by Airbnb, Airflow is written in Python and the workflows are created in Python Scripts. It follows the principles of configuration-as-a-code. Airflow uses something called as Directed Acyclic Graphs (DAGs) to manage workflow orchestration.
Steps to Install Apache AirFlow
This tutorial is divided into 3 parts — Installing VirtualBox, Setting up the Virtual Machine and Installing AirFlow
Part 1: Download and Install VirtualBox
If you have the VB installed, skip to Part 2
Go to the virtualbox page: https://www.virtualbox.org/wiki/Downloads
Double click the downloaded package and follow the instructions to install
Once installed, open Virtual Box and you should obtain the following output
Part 2: Setting up the Virtual Machine
Download the VM file — AirflowVM.ova from here
The file is about 2.3 GB and may take time to download
Double click on it
You might NEED to uncheck Import hard drives as VDI if case you get an error after importing it related to ‘medium’
Click on “Start” and wait for the VM to start until you get the following output from it
If you get the error like below on starting the VM then follow these steps.
System Preferences >Security & Privacy (General tab)
Make sure App Store and identified developers option is selected in Allow apps downloaded from section. Do not forget to Restart the laptop. The error should go away.
Part 3: Install Apache AirFlow
Download and install VS Code editor. From the Extensions tab, install the Remote SSH plugin.
Open the Terminal in VS Code and install the Python Virtual Environment. This is because we are trying to avoid the any potential mess-up of the Python packages
airflow@airflowvm: python3 -m venv sandbox
This creates a python virtual environment called sandbox
(sandbox) airflow@airflowvm: source sandbox/bin/activate
This activates the sandbox
(sandbox) airflow@airflowvm: pip install wheel
Installs the wheel package
Now you can install Apache AirFlow (we are installing version 2.1.0)
(sandbox) airflow@airflowvm:~$ pip install apache-airflow==2.1.0 — constraint https://gist.githubusercontent.com/marclamberti/742efaef5b2d94f44666b0aec020be7c/raw/21c88601337250b6fd93f1adceb55282fb07b7ed/constraint.txt
Note constraint option is important. It is followed by the location of the constraint file which ‘fixes’ the version of the python dependencies. Otherwise, when new versions of the dependencies are released, it may or may not work with airflow
If everything goes well, you should get the (sandbox) airflow@airflowvm: command prompt
This step may take about 10 mins.
(sandbox) airflow@airflowvm: airflow db init
This initialises the db of airflow and creates some files that are needed for airflow to run. You should get Initialization done message if everything goes well
Now start the web interface of airflow via
(sandbox) airflow@airflowvm: airflow webserver
At this stage, you have successfully installed airflow and are ready to create DAG workflows
Follow me on LinkedIn