Apache Airflow is an open-source Python framework used to programmatically author, schedule and monitor workflows. It is primarily used for data extract, transformation and load (ETL) pipelines.
- Airflow uses the concept of Directed Acyclic Graphs (DAGs) to define a workflow. In Airflow, DAGs are defined in Python.
- It is not like a traditional ETL that does the intensive tasks of loading and transforming data itself. Instead it is a workflow orchestration tool with scheduling which uses operators to execute any Python code, typically to call other services which do the heavy tasks of loading and transformation.
- Google have a managed installation of Airflow called Cloud Composer that integrates with other services in Google Cloud Platform. There is also a managed Airflow service available from Astronomer.
- Comes with a fully featured UI out of the box:
- Monitoring - pipeline and task statuses are displayed in real-time on a visual representation of a DAG. This representation is also a useful development tool to validate task dependencies.
- Centralised logging - logs for any task can easily be viewed.
- Pipeline management - pipelines and tasks can be manually started, stopped or restarted for any run (including historic runs).
- User management - users can be created with custom permissions.
- Many operators are available in the default installation, allowing you to interact with many services such as AWS, Azure, Google Cloud and most databases.
- It is an open platform so you can write your own operators to do things that are not available out of the box.
- As DAGs are written as Python code, it is easy to integrate with source control and CI/CD pipelines.
- Unit testing of custom operators is possible as it is standard Python code.
- Can be horizontally scaled to run on multiple nodes via the Celery and Kubernetes executors.
- Can compose pipelines or trigger a pipeline in response to an event via the 'sensors' feature.
- Can 'backfill' by re-running all pipeline runs from a specified point in time.
- Pipelines can be retried from the task that failed.
- Can optionally feed an output from one task into the next via the XCOM feature.
- It is open source so there are no licensing costs.
- Unless using one of the Airflow managed services, the tool must be self-hosted. Managing the 3 core components (web server, scheduler, executor) adds additional overhead. Local development can be cumbersome for the same reason.
- The UI can be unresponsive and clunky at times.
- The way Airflow handles times can be confusing when first using the product as it runs a DAG at the end of its scheduled period rather than the beginning. Towards Data Science has more detail.
- Horizontal scaling (via Celery or Kubernetes executors) can be complicated to implement and involves managing concurrency and parallelism configuration.
- It's common to have performance issues when scaling to a large number of pipelines/tasks.
- Although it has a nice GUI for running and monitoring DAGs, they must be created in Python code.
- Although pipelines can only run once per schedule, they can be retried infinitely. This makes Airflow unsuitable when the required frequency is dynamic in a given period, often requiring designing the pipeline around this restriction.
- Fully automated pipeline testing can be hard to set up. This applies to most data pipeline tools and is not specific to Airflow.
Comparison with Dagster
- Dagster has a cleaner and richer React UI. DAG visualisations provide more detailed information and logs can be queried and filtered.
- Passing data between tasks is a first-class concept in Dagster. Tasks expect typed inputs and return typed outputs which makes testing tasks significantly easier.
- Writing Dagster pipelines feels like functional programming. Tasks are simply decorated functions and dependencies are defined via function composition.
- Local development is easier in Dagster. Pipelines can be run on the fly with config provided as YAML in the UI.
- Dagster has no constraints around '1 pipeline run per period'.
- In Dagster, pipelines can only be retried from the beginning.
- Dagster does not come with (Airflow-like) sensors out of the box.