There were basically two ways to handle this scenario with Airflow prior to 2.3. You can’t predict the value of n, but you want Airflow to automatically schedule and run n tasks against these files. Basically, you have a scenario in which you need to process n items in a collection - say, files in a directory, a container in object storage, or a bucket in AWS S3. The case for dynamic task mapping is obvious enough. Airflow 2.3 also queues up a new grid view, which replaces Airflow’s default tree view a local Kubernetes (K8s) executor, which gives users a new, useful option for running different kinds of tasks a new option for serializing Airflow connections and a slew of other features.īelow, we dig into Airflow 2.3 - starting with dynamic task mapping. Still, while dynamic task mapping may be the breakout hit of Airflow 2.3, the new release is not a one-hit wonder. Not only does this type of for-loop describe a common ETL pattern, it encompasses a very wide variety of use cases, about which more below. The upshot is that dynamic task mapping pays immediate dividends right now - and will pay even bigger dividends in future. Instead of manually invoking an operator to perform these and similar tasks, Airflow is now able to dynamically change the topology of your workflow at runtime. Imagine being able to map tasks to variables, DAG configs, or even database tables. Dynamic tasks are essential when (for example) you can’t predict how many files (or tuples, models, etc.) you’re going to need to process: you just know you’re going to need to process some number of them. run the same task once per item in the collection. Its archetypal pattern is a for-loop: for any collection - of files, tuples, models, etc. Task mapping is a meat-and-potatoes kind of thing. For another, dynamic task mapping is an inceptive feature - a new capacity the Apache Airflow community will build on as we keep expanding what it is possible to do with Airflow. For one thing, it’s an extremely useful new capability that becomes available to users with just a minor version upgrade, meaning it maintains full backward compatibility with Airflow 2.x. A new command built into the Airflow CLI that you can invoke to reserialize DAGsĪirflow 2.3 implements the dynamic task mapping API first described in AIP-42 - a genuine game changer for the hundreds of thousands of teams that depend on Airflow today.New ability to store Airflow connections in JSON instead of URI format.A new listener plugin API that tracks TaskInstance state changes.A new REST API endpoint that lets you bulk-pause/resume DAGs.A new LocalKubenetesExecutor you can use to balance tradeoffs between the LocalExecutor and KubernetesExecutor.A new, improved grid view that replaces the tree view.Dynamic task mapping lets Airflow trigger tasks based on unpredictable input conditions. So Airflow 2.0 and 2.2 were big releases - but 2.3 is arguably even bigger. This gives you new options for accommodating long-running tasks, as well as for scaling their Airflow environments. And then there’s Airflow 2.2, which introduced support for deferrable operators, allowing you to schedule and manage tasks to run asynchronously. After all, Airflow 2.0 debuted with a refactored scheduler that eliminated a prior single point of failure and vastly improved performance. Believe it or not, Airflow 2.3 might be the most important release of Airflow to date.
0 Comments
Leave a Reply. |