Airflow is purpose-built for high-scale workloads and high availability on a distributed platform. Since the advent of Airflow 2.0, there are even more tools and features to ensure that Airflow can be scaled to accommodate high-throughput, data-intensive workloads. In this webinar, Alex Kennedy will discuss the process of scaling out Airflow utilizing the Celery and Kubernetes Executor, including the parameters that need to be tuned when adding nodes to Airflow and the thought process behind deciding when it’s a good idea to scale Airflow, horizontally and vertically. Consistent and aggregated logging is key when scaling Airflow, and we will also briefly discuss best practices for logging on a distributed Airflow platform, as well as the pitfalls that many Airflow users experience when designing and building their distributed Airflow platform.
Key Takeaways:
- With the right infrastructure and architecture, Airflow is capable of massive scale! Getting there will require patience and experimentation, but the latest versions of Airflow make this process as painless as possible.
Airflow’s CeleryExecutor and KubernetesExecutor are designed for scalable workloads.
- There are key parameters in your Airflow configuration which will need to be carefully tuned in order to allow Airflow to scale smoothly and provide minimal latency between tasks.
- Scaling with Celery is as easy as adding a node to your cluster, and providing the correct configuration and Airflow files to that node.
- Aggregated and consistent logging is crucial for being able to debug the scaled Airflow platform.
30 июл 2024