Skip to main content

Airflow

Apache Airflow is an open-source Python framework used to programmatically author, schedule and monitor workflows. It is primarily used for data extract, transformation and load (ETL) pipelines.

Overview

  • Airflow uses the concept of Directed Acyclic Graphs (DAGs) to define a workflow. In Airflow, DAGs are defined in Python.
  • It is not like a traditional ETL that does the intensive tasks of loading and transforming data itself. Instead it is a workflow orchestration tool with scheduling which uses operators to execute any Python code, typically to call other services which do the heavy tasks of loading and transformation.
  • Google have a managed installation of Airflow called Cloud Composer that integrates with other services in Google Cloud Platform. There is also a managed Airflow service available from Astronomer.

Advantages

  • Comes with a fully featured UI out of the box:
    • Monitoring - pipeline and task statuses are displayed in real-time on a visual representation of a DAG. This representation is also a useful development tool to validate task dependencies.
    • Centralised logging - logs for any task can easily be viewed.
    • Pipeline management - pipelines and tasks can be manually started, stopped or restarted for any run (including historic runs).
    • User management - users can be created with custom permissions.
  • Many operators are available in the default installation, allowing you to interact with many services such as AWS, Azure, Google Cloud and most databases.
  • It is an open platform so you can write your own operators to do things that are not available out of the box.
  • As DAGs are written as Python code, it is easy to integrate with source control and CI/CD pipelines.
  • Unit testing of custom operators is possible as it is standard Python code.
  • Can be horizontally scaled to run on multiple nodes via the Celery and Kubernetes executors.
  • Can compose pipelines or trigger a pipeline in response to an event via the 'sensors' feature.
  • Can 'backfill' by re-running all pipeline runs from a specified point in time.
  • Pipelines can be retried from the task that failed.
  • Can optionally feed an output from one task into the next via the XCOM feature.
  • It is open source so there are no licensing costs.

Disadvantages

  • Unless using one of the Airflow managed services, the tool must be self-hosted. Managing the 3 core components (web server, scheduler, executor) adds additional overhead. Local development can be cumbersome for the same reason.
  • The UI can be unresponsive and clunky at times.
  • The way Airflow handles times can be confusing when first using the product as it runs a DAG at the end of its scheduled period rather than the beginning. Towards Data Science has more detail.
  • Horizontal scaling (via Celery or Kubernetes executors) can be complicated to implement and involves managing concurrency and parallelism configuration.
  • It's common to have performance issues when scaling to a large number of pipelines/tasks.
  • Although it has a nice GUI for running and monitoring DAGs, they must be created in Python code.
  • Although pipelines can only run once per schedule, they can be retried infinitely. This makes Airflow unsuitable when the required frequency is dynamic in a given period, often requiring designing the pipeline around this restriction.
  • Fully automated pipeline testing can be hard to set up. This applies to most data pipeline tools and is not specific to Airflow.

Alternatives

Other alternatives to consider are AWS Glue, AWS Step Functions, Luigi, Dagster, Prefect, Kedro or Metaflow.

Comparison with Dagster

Advantages

  • Dagster has a cleaner and richer React UI. DAG visualisations provide more detailed information and logs can be queried and filtered.
  • Passing data between tasks is a first-class concept in Dagster. Tasks expect typed inputs and return typed outputs which makes testing tasks significantly easier.
  • Writing Dagster pipelines feels like functional programming. Tasks are simply decorated functions and dependencies are defined via function composition.
  • Local development is easier in Dagster. Pipelines can be run on the fly with config provided as YAML in the UI.
  • Dagster has no constraints around '1 pipeline run per period'.

Disadvantages

  • In Dagster, pipelines can only be retried from the beginning.
  • Dagster does not come with (Airflow-like) sensors out of the box.