“There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.” — Eric Schmidt, Former CEO and Executive Chairman of Google
Whether this quote from 2010 is accurate or
not, the world of data-driven businesses is a very competitive one. Every group leader or product manager out there needs a framework in place for the data science
pipeline in order to improve efficiency, overall revenue, and day-to-day operations, otherwise it’s a setup for failure.
Enter
workflow. Generally speaking, workflow is a managed and repeatable sequence of operations that provide services or process information. Using it correctly can minimize room for errors and increase overall efficiency.
But what happens if one or more of the operations is poorly configured? The company will waste time and lose money on each workflow cycle!
Albert Einstein once said that “Insanity is doing the same thing over and over again and expecting different results”. Therefore, There is a need for a tool that assists in easily and properly creating workflows, and above all enables to monitor and maintain them.
Apache Airflow is a platform to “programmatically author, schedule and monitor workflows” that became the de-facto choice to manage data workflows.
It helps in creating workflows, visualizing pipelines and monitoring their progress.
What is Apache Airflow?
Airflow is an open-source tool for orchestrating complex computational workflows and data processing pipelines that was originally introduced by Airbnb.
This tool enables author workflows as
Directed Acyclic Graphs (DAGs) of tasks. It consists of a
scheduler that executes said tasks on an array of workers, making it scalable to infinity, and a
user interface that makes it easy to visualize created pipelines, monitor their progress, and troubleshoot any issues should they arise.
Airflow configures pipelines as code written in Python, and as such they become more maintainable and testable, allowing dynamic pipeline creation as well as personalizing by extending the library to fit one needs.
Airflow is not a data streaming solution. Tasks do not move data from one to the next (though they can communicate with each other!) and workflows are expected to be mostly static or slowly changing and to look similar from one run to the next.
Example use cases for Airflow:
Core Concepts
Short for directed acyclic graphs, DAG is an organized collection of all the tasks one wants to run, that reflects their dependencies and relationship. It is defined in a Python script, which represents its structure as code and placed in Airflow’s DAG_FOLDER.
Airflow will execute the code in each script to dynamically build the DAG objects. There is no limit to the number of DAGs, and each can describe an arbitrary number of tasks. In general, each one should correspond to a single logical workflow. For example: