Where Apache Airflow fails and the 7 best alternatives currently on the market, with a special emphasis on their strengths, weaknesses, and best-fit users.
Who doesn’t love Apache Airflow? The Python-based open-source tool allows us to schedule and automate workflows with DAGs (Directed Acyclic Graphs). Data teams use Airflow for a myriad of use cases: from building ETL data pipelines to launching machine learning apps.
The open-source tool makes workflow management easy: it is extensible, easy to monitor from the intuitive user interface in real time, and it allows you to build dependencies between jobs.
Unfortunately, many companies who started with Airflow have found out it doesn’t keep up with their data needs. Airflow doesn’t deliver in complex data ecosystems and its design makes it hard to adopt it for many crucial stakeholders.
In this article we’ll look at where Apache Airflow fails and the 7 best alternatives currently on the market, with a special emphasis on their strengths, weaknesses, and best-fit users:
But first - where does Apache Airflow fall short?
Build fully automated, dynamically scalable workflows in a couple of clicks. Start for free, without swiping your credit card details.
Despite all its praises, Airflow fails on several levels:
Little-to-no (deployment) support. Other than Stack Overflow, you’ll have to run Airflow on-premise as a self-service solution. Some Cloud providers expose Airflow’s web user interface (UI) or command line interface to paying customers (e.g. Google Cloud Platform via Cloud Composer, AWS under Amazon Managed Workflows for Apache Airflow (MWAA), and Microsoft Azure with Docker/Kubernetes deployments), but the managed service is often pricier than its cloud-native alternatives (e.g. AWS step functions). Prepare yourself and your data engineering teams to spend some time debugging the tool before making it work.
It requires programming and DevOps skills. Apache Airflow’s workflow as code philosophy excludes domain experts who need to self-serve their data needs without knowing Python or another programming language.
No versioning of data pipelines. There is no way to redeploy a deleted Task or DAG. What’s worse, Airflow doesn’t preserve metadata for deleted jobs, so debugging and data management are especially challenging.
Windows users cannot use it locally. On Windows, your only option is to run Airflow with docker-compose. ‘Nuff said.
Confusing and obsolete scheduler. As a design choice, Airflow limits users in how they trigger tasks. All tasks need to be run via a date scheduler, but the date scheduler prevents triggering the same task twice at the same time. You have to create two identical tasks to mimic job repetition. What’s worse, tasks are scheduled via the execution_date parameter, which is interpreted not as the start time for a DAG, but as the end of an interval capped by the DAG’s start time. Huh? Time to do some mental arithmetic.
Luckily, alternative tools make your life easier.
Let’s take a look at the best Apache Airflow alternatives on the market today.
Keboola is the best alternative to Airflow (and other workflow engines). While we might be biased, customer reviews (4,7 out of 5 stars on G2) support our claims. This data platform as a service offers you process automation for every step of the data lifecycle.
Keboola helps you manage, optimize, and automate all your data operations in one platform. From orchestrating workflows to building ETL data pipelines, or even launching machine learning products, Keboola offers plenty of out-of-the-box features to streamline your business processes.
This orchestration platform offers no-code, full-code (Python, SQL, dbt orchestrator …), and API orchestrations. Empowering everyone from domain experts with zero programming knowledge to technically proficient data engineers to do their best jobs.
You can build data pipelines and workflows with over 250+ pre-built components that perform tasks with a couple clicks. These pre-built integrations allow you to do anything - extract raw data from your SaaS data sources in minutes, send notifications to a Slack channel, ingest big-data scale datasets into your Snowflake data warehouse, or even orchestrate AWS Lambda functions from Keboola.
Fully extensible. If there is no pre-built connector, you can use the Generic Extractor that can collect data from any API-like source or the Generic Writer to load data to any source.
Seamlessly scalable even with big data workflows (yes, data scientists <3 Keboola).
Extensive workflow automation features, such as out-of-the-box granular monitoring and data lineage, self-healing pipelines and retries, dynamic backend autoscaling, and others.
Keboola’s ecosystem of features optimizes and automates the work across many crucial roles: data science, DevOps, data engineering, versioning, data management, data security, and other data-driven business processes.
“Instead of separately selecting, acquiring, configuring and integrating an endless list of technologies to build your data stack, Keboola gets you there in one platform.”Robert C., Head of Product at Gymbeam
Keboola starts workflows via webhooks or with a cron-like orchestrator. It’s not designed for real-time workflows.
Luigi is a framework that helps you stitch many tasks together, such as a Hive query with a Hadoop job in Java, followed by a Spark job in Scala that ends up dumping a table in a database.
Jobs are written in Python and Luigi’s architecture is highly intuitive.
Luigi makes it much simpler than Airflow to restart failed pipelines.
Hard to design task dependencies. For example, executing one task in parallel to another, and another task consecutively to the first two.
No distribution of execution, Luigi will overload worker nodes with big data jobs. This makes it more appropriate for small to mid-data jobs.
Some features are only available to users on Unix systems - sorry Windows users.
Job processing is done via batch compute, so not useful for real-time workflows.
Job scheduling is achieved via cron jobs, there are no devoted triggers such as events-triggered workflows.
Best for: Backend developer automating hard data flows with a Python-like solution.
Prefect is a data flow automation platform that allows you to provision and run your workflows using Python code from the Prefect’s user interface or via their API.
Prefect offers both a managed solution and an open-source solution for self-service deployment.
Easy to parametrize workflows and make them dynamic.
Can be parallelized and scaled using Kubernetes.
Supports event-driven workflows.
Very limited free tier. Their pricing for an actual workable workflow product starts at $450/month.
Really difficult deployment of the self-service solution.
Best for: Enterprise users who are looking for a pricier but managed workflow orchestrator.
Dagster is a devoted data orchestrating tool. By focusing on data, Dagster provides a solution that is very similar to Apache Airflow but has more of an asset-based approach to orchestration. With Dagster you can specify data pipelines in terms of data asset dependencies - files, tables, and machine learning models.
Separates IO and resources from the DAG logic, so it is easier to test locally than Airflow.
Easy to test locally before pushing to production. Its design facilitates CI, code reviews, staging environments, and debugging.
Convoluted pricing model - the cloud solution is billed per minute of compute time but the pricing differs. There is an open-source version on GitHub that you can run as a self-service solution, but be ready for the steep learning curve.
Best for: The data-focused practitioner who has experience with data engineering.
Jenkins is an open-source automation server that uses plugins to automate CI/CD. The DevOps tool is written in Java and helps you automate CI/CD pipelines with 1800+ community-built plugins.
Extremely powerful and versatile solution. The design of workflow-as-code allows you to customize Jenkins pipelines to every whim.
1800+ community contributed Jenkins plugins will help you find the solution for building, deploying and automating any project faster and without having to code the solution yourself.
Jenkins focuses on the CI/CD pipeline. This supposes your workflows are mostly software-based and that you are very experienced with programming.
Because of the software focus, many data operations such as outlier detection, data cleaning, efficient database replication, etc. are not available out-of-the-box.
Best for: The software (frontend/backend) developer and data engineer looking to automate low-level processes with Jenkins.
Astronomer (also called Astro) is an integrated managed platform that runs Apache Airflow for you. Astro takes care of all the DevOps parts, so you can focus on building workflows and data pipelines.
A fully managed version of Apache Airflow.
Offers building blocks for pipelines in Python and SQL.
Mitigates Airflow’s issues with local testing and debugging by offering its own CI/CD tool.
Unclear pricing. You’ll have to talk to sales to get a quote.
The Astronomer’s managed solution still carries some of Airflow issues, such as Task triggering limitations and a lack of version control.
Best for: Data teams who want to keep using Apache Airflow, but don’t want to care about the management and DevOps aspects of it.
7. Apache NiFi
Apache NiFi is a web-based data flow builder. You construct data pipelines via a graphical user interface that builds workflows as DAGs (directed acyclic graphs).
Highly-configurable workflows - from throughput to latency control, you can really customize Apache NiFi.
It can be used for real-time data streaming.
The Apache NiFi’s data provenance module allows you to track data throughout its lineage.
Despite offering some security features (HTTPS for data in transit, authorization, …) Apache NiFi does not offer extensive security features out of the box.
The configuration and troubleshooting are highly-technical. The target user needs deep knowledge of Java and telecommunication protocols in IT to debug system failures.
Best for: SysAdmins and DevOps professionals building data flows.
How to choose the best workflow orchestration for your organization?
When you’re selecting the best workflow orchestration tool for your company, follow these criteria:
Target user. Some tools are no-code, while others require you to know how to code in SQL, Python, Scala, Java, etc. Pick the right tool depending on who will use the workflow orchestration tool (ex.: data engineer vs business expert). For example, Keboola’s Visual Flow Builder allows users to create an ETL pipeline via a drag-and-drop no-code UI, while Keboola’s low-code blocks give engineers the speed and flexibility to build ETL pipelines via a code-based logic.
Pricing. Open-source solutions are usually cheaper from the start but accumulate costs when you use them because of the maintenance, debugging, custom coding… Vendor platforms are the other way around, since the tool providers usually cover the cost of maintenance. Calculate the total cost of ownership to understand what pricing level is the best for your usage needs.
Ease of use. Intuitiveness will help you speed up the deployment of new workflows. Pick tools that shorten the number of keystrokes and button presses.
Extra features. Look for features that improve workflow orchestration, but are not necessarily part of the package. For example, Keboola is offers a rich ecosystem of features such as the Data Catalog that allows you to document and manage your data workflows right where they are generated.
Support and documentation. When things go wrong, is there a strong support system, such as vendor-guaranteed SLAs for support? Or if the tool is open-source, is there a strong community of users who can answer your questions? Is there extensive documentation you can rely on?
Choose Keboola for a fast and easy workflow orchestration that scales
Keboola helped hundreds of companies to automate their workflows in its easy-to-use platform while taking care of all the back office work.