Workflow Orchestration Tools Comparison

Raaid,softwareopinion

So you need to kick off some data pipelines based off of a trigger, or on some set schedule. You want to see if anything goes wrong, you want easy to parse logs, and you'd love a clean UI. It'd be great to configure your runs to run locally or in a specific environment, you certainly want to be able to test it, you don't want to learn a new DSL, and you want to make sure it can handle lots of running jobs. Seems like a pretty common problem, so there should be a go-to solution, right?

It didn't seem super clear for me, and I had to go through some pain to figure out what worked for me (and more importantly what didn't). This is an updated version of a document I wrote while at The Broad Institute and was working on orchestrating ETL pipelines that handle terabytes of file data and metadata. I ran into a lot of friction with some of these tools, and I hope my experience might help others avoid the same furstrations I ran into.

Note that I only considered tools I could use relatively easily as a Python developer, and did not consider tools or frameworks in other common languages like Java, JS, etc. I avoided considering fully managed tools like Google Cloud Composer as I prefer to avoid vendor lock-in. I didn't consider tools that are only for the pipeline part (like Apache Beam or Google Dataflow built on it). I'd also like to note that my suggestions work for smaller projects, as I am using my recommended tool on Pop the Bubble News, which is far less data than what I dealt with while working on the Human Cell Atlas project at the Broad.

tl;dr: You almost certainly don't want to use Argo Workflows unless your use case is very simple and linear. Dagster is excellent, especially when paired with Pydantic. If you hate stronger typing, go with Prefect instead.

Tools Considered

Evaluation Criteria

Airflow

Summary

Not considered, though it is prevalent in the space, backed by Airbnb, and has lots of users. DAG deployment is tricky, the Helm chart doesn’t have official support, and there are a lot of hacks and workarounds to get Airflow to work the way we would want and we prefer to not work against how something is designed if there is a viable alternative. Also, Prefect's hit piece (opens in a new tab) is quite persuasive and makes some very good points about why one shouldn't use Airflow.

A summary of things that are not supported in a first-class way (lifted from Prefect's piece):

Argo Workflows

Summary

It works, but deployment, configuration, development, and testing are currently awful. Code should not be written in YAML, even if it is backed by Intuit, has corporate users, and is an incubating project of the CNCF.

Evaluation

Dagster

Summary

Ideal solution barring newness, upfront work, and lack of notable users/backing so far, though there is a company (opens in a new tab) behind it now.

Evaluation

Luigi

Summary

Not considered. A former coworker used it a lot prior to joining my team and told me to avoid using it, and I greatly respect their opinion. The docs are subpar, the use-case is very Spotify-specific, and the maintainers don’t maintain it actively. I can’t find suitable deployment methods (no Helm chart!), and it seems to assume only scheduled workflows (not allowing for on-demand ones). Additionally, the syntax seems unnecessarily complex compared to Prefect/Dagster. Removed from consideration.

Prefect

Summary

Seems nearly ideal barring newness, Dask on Kubernetes instead of directly to Kubernetes and an overall less straightforward deployment, and the fact that they have a product to sell that they want you to ultimately use instead of using it entirely open source.

Evaluation

Serverless approach

Summary

Costs less and involves less infrastructure, but at the cost of lost visibility, retryability, and Cloud-provider lock-in.

Evaluation

This would be a completely different approach. It would:

Why this serverless approach is better than the DAG approach:

Why the DAG approach is better than the serverless approach:

Conclusions

Airflow (and therefore Cloud Composer or anything built on it) and Luigi require working against the framework at times or simply don't meet the evaluation criteria. Writing workflows with the complexity should not be done in YAML (it gets so, SO, so messy), and a lack of clean deployment, local development, and testing rules out Argo Workflows.

I think the serverless approach is interesting, but the DAG approach makes it easier to do a great many things, particularly on the dev experience side. I think the biggest advantage of the serverless approach is lower cost and less infrastructure, but doing so at the cost of losing visibility, retryability, debugging ability, and testability is not worth it. I don’t recommend going with the serverless approach given current technology.

I recommend Dagster. It is very well documented, has a fully supported official Helm chart, the community is excellent, the concepts are clear, the UI is spectacular, and embraces stronger typing in Python (which I feel strongly about). Prefect works great too, but I think Dagster is better on the typing front, the UI edges out Prefect's, and the biggest blemish to Prefect is the difficulty in deployment (which makes sense because they want you to use Prefect Cloud, not deploy your own).

This is just one person's opinion, so if you disagree, great! My experience and research with these tools led me to Dagster and far, far away from Argo Workflows.

© Raaid Arshad.RSS