Тёмный

Don't Use Apache Airflow 

Bryan Cafferky
Подписаться 40 тыс.
Просмотров 88 тыс.
50% 1

Apache Airflow is touted as the answer to all your data movement and transformation problems but is it? In this video, I explain what Airflow is, why it is not the answer for most data movement and transformation needs, and provide some better options.
Join my Patreon Community and Watch this Video without Ads!
www.patreon.com/bePatron?u=63...
Slides
github.com/bcafferky/shared/b...
Follow me on Twitter
@BryanCafferky
Follow Me on LinkedIn
/ bryancafferky

Наука

Опубликовано:

 

26 июн 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 191   
@wexwexexort
@wexwexexort 2 года назад
I wasn't using it but after this video I just changed my mind. I'm gonna schedule some jobs using Airflow next sprint.
@Seatek_Ark
@Seatek_Ark Год назад
I was recently brought onto a team to convert our ETLs from Apache Nifi over to Airflow and while your assessment is fine, I think there's a few areas where I would have structured this differently. 1. Airflow is not an ETL, you're right in calling it a job scheduler, it's technically referred to as a task scheduler. In your ETL processes you have really 4 things that you're trying to do- a. trigger when an event happens (an email is received, x amount of time has passed, someone put a file in your fileshare or s3 bucket, some notification prompts you to start). b. Extract your data from one location. c. Transform your data. This is where the bulk of your coding comes into play d. put your data into it's appropriate database or storage e. make sure a-d goes off without an issue. The reason why Airflow is a great ETL tool is because it does A and E by itself reall well, and it facilitates B and D. Hooks and sensors are built into airflow, and are fully customizable. If your project is reliant on programs like Glue then you can do all of this in the AWS suite (or Azure or GCP), but if Airflow very cleanly packages up your connection points and your custom etl and runs that sequence of tasks beautifully. Should you default to airflow? If your data engineers are already experts it's fine, if not, then no. Is it the magic tool to ETL? No, watch for AWS and fellow tech giants to come out with something like that in the next 5-10 years. Is it the best task scheduler? Due to support it's miles ahead of its competitors, so yes.
@vasdecabeza2
@vasdecabeza2 8 месяцев назад
I agree. Furthermore, Airflow is a workflow/orchestration [management] tool/platform, that's why it includes Job/Task scheduling, monitoring, retries, and other features. On the other hand, there are things I don't like from Airflow like lack of a declarative way (via JSON or YAML) to define DAG and tasks.
@gudata1
@gudata1 2 года назад
Airflow is a scheduler and it doesn't care about what code you run. The easiest is to pack all your golang/rust/python code in docker containers and scale with that.
@sanjaybhatikar
@sanjaybhatikar 2 года назад
Beautifully explained! I love how you dive into the code without getting lost in the weeds. Very helpful, thank you :)
@BryanCafferky
@BryanCafferky 2 года назад
Thanks!
@tomhas4442
@tomhas4442 Год назад
Been using airflow a little over a year now and totally agree with most of your points. Appreciate it for logging, monitoring of pipelines and the visualizations. Also the good K8s integrations and active community. Would recommend it if most of your code to orchestrate is Python or dockerized. It does come with some downsides like the lack of pipeline version management or the complex setup. There are managed versions though, e.g. Cloud Composer
@ariocecchettini1159
@ariocecchettini1159 Год назад
Dear Bryan, thank you for your informative video! For me personally it is actually great news that Airflow IS NOT a full-fledged ETL tool, this is actually exactly what I need. I honestly don't see mentioned limitations (no ETL functionality) as a disadvantage. ETL as a concept is also becoming outdated, in the wake of new approaches such as data mesh and service mesh solutions. What is definitely a no-no is the amount of code overhead and the strong coupling. Will definitely look into suggested tools.
@BryanCafferky
@BryanCafferky Год назад
Yeah. It is good for orchestration and it can work with Databricks.
@yevgenym9204
@yevgenym9204 2 года назад
As someone coming from SSIS and literally hate it for being all too much graphical interface, I have to say you did a good job about describing the problems with AirFlow.
@BryanCafferky
@BryanCafferky 2 года назад
Thanks
@AP-nq4pe
@AP-nq4pe 2 года назад
Only thing I hate in SSIS is the variables. If you follow ELT pattern and do minimal/no data transformation in the package, it is nice, scalable and most importantly easy to administer / manage, without tons of code .
@BryanCafferky
@BryanCafferky 2 года назад
@@AP-nq4pe Best to do most work in SQL Server T-SQL but it SSIS does orchestrate well. Package parameters are also a nice feature.
@michaeldowd5545
@michaeldowd5545 10 месяцев назад
Python Code is often Keep It Stupid Simple compared to SSIS or other tools for that matter.
@mirmir1918
@mirmir1918 Год назад
Very good explanation ! It s good that other participants(products from aws or ms etc ) mentioned
@lahvoopatel2661
@lahvoopatel2661 Год назад
This is amazing. Rarely anyone is so fair in evaluating popular tool like airflow
@BryanCafferky
@BryanCafferky Год назад
Thank you. There are some who disagree but I was trying to be fair.
@paleface_brother
@paleface_brother 2 года назад
Thank you, Bryan, for your videos. They are really useful. It will be very kind of you to make lessons about Apache NiFi, especially how to choose processors for needed actions.
@BryanCafferky
@BryanCafferky Год назад
Possibly. So many tools out there. Thanks for the suggestion.
@oyeyemirafiuowolabi2347
@oyeyemirafiuowolabi2347 Год назад
I agreed with you on this. Thanks Mr Bryan.
@-MaCkRage-
@-MaCkRage- Год назад
I'm a developer in data analytics team. And now I'm setting up an apach airflow for my team. They will create dags using jupiter lab and it will very comfortable.
@bnmeier
@bnmeier Год назад
Although I agree with most of what was said in this video I do have some comments that would likely change someone's mind as it pertains to using Airflow in a real world business scenario. I agree Airflow is not an ETL/ELT tool. I would agree that it is a scheduler. I disagree that code is not reusable. That's one of the reasons why providers and operators exist. If you want to use the same set of tasks multiple times inside the current project or across multiple projects, create a custom operator and use it where you wish. If you are running a medium to large business and the company/IT philosophy is to adopt products that have vendor support, then NiFi and Kettle are not going to be for you. There is no one to call for support when your production instance of either of those goes down. With Airflow a business has the ability to go with Astronomer for a fully vendor supported and highly automated solution which doesn't require the heavy lift of setup. Anyone saying they use AWS Glue and love it, has either not used it or is lying to you. Simply put, it's got a long ways to go to catch up with most orchestrator type tools like Azure Data Factory. If you are in a situation where your company has chosen AWS as their cloud provider and Snowflake as their cloud data warehouse, your options are limited for orchestration of workflow which is a major playing in a complete data pipeline strategy. Products like Matillion are great for drag and drop functionality but are expensive and have a huge deficiency in deployment pipelines and ci/cd implementation. If you are living in the cloud data space and don't know Python at least at a basic level, there is a good chance you are entry level and will need to learn it at some point or not very effective at putting together data pipelines. One of the most powerful module/libraries/etc available to someone in the data space is the Pandas Python module. This becomes a very powerful tool in Airflow or any other orchestration engine dealing with data movement. Just my 2 cents. Again, I don't disagree with what was said. I just think there are way more valid use cases and reasons to use Airflow then insinuated.
@MichaelCizmar
@MichaelCizmar 2 года назад
Thanks for this. It is easy to understand things sometimes in the context of when you should not use it rather than what it's for.
@BryanCafferky
@BryanCafferky 2 года назад
YW
@shutaozhang9827
@shutaozhang9827 2 года назад
I am studying Apache NiFi now, it looks like a good tool for ETL purpose, thanks for your comments.
@patrickbateman7665
@patrickbateman7665 2 года назад
Recently an idiot on Reddit argued with me by saying Airflow is better than Data Factory. This video says it all. Thanks Alot Bryan 🙏
@BryanCafferky
@BryanCafferky 2 года назад
Well, Airflow may be better at some things but not data movement/transformations in most use cases. ADF is a solid choice if you are on Azure.
@Theoboeguy
@Theoboeguy Год назад
whether I end up using airflow or not, this is a great video that clearly explains how to use the tool and your perspective. thank you!
@BryanCafferky
@BryanCafferky Год назад
Thanks for your kind words. Glad it is helpful!
@DodaGarcia
@DodaGarcia 5 месяцев назад
I've been using Airflow for a little over a year and your video really confirmed that a lot of the things that have been bugging me about it are not really a me problem. I really love how powerful it is, but having been using it mostly for ETL, I've often found myself overwhelmed with all the coupling and the little "gotchas" in the form of how specifically things have to be set up. It adds a lot of overhead from the get-go, and importantly, means that no matter how well designed the business code is, whenever something breaks or needs to be changed I always need to re-learn all of the Airflow-specific code. I can see why it's a favorite for specialized data teams whose main job is maintaining data pipelines, but not for use cases like mine in which the data flow management is just a small part of the job. So not really anything wrong with Airflow, just that it might be overkill for users like myself. I'm going to look into some of the ETL tools you mentioned, and one thing I'm very interested in using Airflow for soon is managing 3D rendering pipelines. I think it's going to be fantastic for coordinating render jobs and their individual frames, which are often in the thousands.
@BryanCafferky
@BryanCafferky 5 месяцев назад
Yeah. After the video, I came to the conclusion that there are job schedulers and orchestrators and often you just need a good job scheduler. When the complexity requires an orchestrator, I recommend you look at Dagster. It is much more extensible, testable, and adds a ton of features over Airflow. I've been studying up on it for months to be sure I liked it. dagster.io/
@igoryurchenko559
@igoryurchenko559 8 месяцев назад
A main issue of defining a function inside of another function is that it's impossible to unittest. But testing is vital for data processing. it looks like all tasks should be written and tested as standalone functions and adapted to airflow by additional abstraction layer.
@supernova5839
@supernova5839 2 года назад
That was good introduction on Apace Airflow and use cases, If it has such complex codes and very limited use cases then why most companies look for Airflow skills while hiring a Big data Engineer .
@BryanCafferky
@BryanCafferky 2 года назад
Good question. My guess is 1) they adopted it based on the false hype and cool log, and 2) they did not evaluate other tools for the job. Also, sometimes managers ask for lots of skills that will never be used. I would clarify in an interview that they really use it.
@f_lyru1304
@f_lyru1304 2 года назад
It's not that complicated I learned the basics in one week and I'm not even pro with python, and if you use a dagfactory or act that builds dags from yaml files which can be created though a friendly interface for users that don't know python. They only need to read the documentation to have better understanding of the operators and theirs arguments and how to create connections.
@evgeny_web
@evgeny_web 2 года назад
Hi, thank you very much for this video. The project where I work plans to replace Apache oozie with Airflow, so I think it is pretty useful to watch video like this one. I don't have any prior knowledge of Airflow, it is very easy to understand the main ideas behind this framework.
@snehotoshbanerjee1938
@snehotoshbanerjee1938 2 года назад
Very nice video Bryan! What is your take on Prefect? They highlighted few short comings in Airflow and hence Prefect. But, Airflow in its recent version came up with lesser boilerplate. But happy to hear back from you on Prefect. Thanks!
@enesteymir
@enesteymir 2 года назад
Thanks clear explanations , I haven't use Airflow yet but it is nearly in the all job posts :) Companies like to use it actually
@BryanCafferky
@BryanCafferky 2 года назад
YW. Yeah. I wonder if they all use it or just like list it in job ads. But it could be. The best tool is often not the one selected for the job. Thanks for watching.
@thiagopdesouza
@thiagopdesouza Год назад
Dear Bryan, thank you very much for this video! Very valuable and straight to the point content. Congrats!
@BryanCafferky
@BryanCafferky Год назад
Thanks You!
@janHodle
@janHodle 11 месяцев назад
In most point I agree... Airflow is not an ELT Tool. It's an orchestrator, in my opinion the best in the world. In the company where I work I built up a BI for online activities. I tried a lot of tools, don't want to mention them all. But they all had a lot of draw backs and where expensive. I ended up using Airflow and I'm pretty happy with it. Sure, it's all code! That's what you have to keep in mind. Other tools like DBT, Airbyte and so on integrate perfectly into Airflow. So scheduling and monitoring the entire pipeline is absolutely great. On the other hand I had to struggle with a lot of data sources where out of the box tools had problems understanding the data. In the end I had to program a middle ware in python to make the data compatible with these tools. Now in Airflow it works inside the Airflow environment. Due to the fact Airflow delivers a lot of good operators the code got even smaller. Furthermore the Docker (Compose) images are great and the Helm Charts are good... So yes: it's not a native ELT tool. You have to use code only... But with code only comes a lot of flexibility. Don't want to go back the kettle, talend or SAP Data Services. What looks interesting is NiFi...
@BryanCafferky
@BryanCafferky 11 месяцев назад
Thanks for the feedback. I agree if you need a high degree of control and have a lot of dependency/complexity it can be a good option. It does not fit most of the use cases I have done over several decades of data engineering work though.
@gamsc
@gamsc Год назад
Thanks. Very informative.
@Jeffsdata_0
@Jeffsdata_0 Год назад
Love the video. Definitely made me think and gave me some good tools to look into. A few notes here (I'm an Airflow noob, but I've at least used it...) 1. It doesn't really work on Windows like it says in the screenshot at the beginning - unless you're using Docker or WSL. It only works on Linux. 2. It does not only support Python. As you mention, there's a BashOperator, which means it can run anything using a bash script (python, JavaScript, php script, Java app, C# console app, etc). 3. I think it's a bit disingenuous to say your DAG code could be more than your actual code running - the DAG definitions are insanely simple... your examples are probably about as complex as 70% of jobs (outside of the actual logic). 4. All the alternate solutions you present also have overhead to learn and their own proprietary outputs (that can't be reused anywhere else - except maybe Data Factory, which might be able to port into SSIS on-prem or whatever). A Python script (or whatever script - Powershell, C# app, etc) can run just about anywhere. 5. Instead of putting your Python logic inside the script, you can just use a BashOperator to run the Python script (ie: "python3 -m path/to/thescript.py") - which means you can decouple and use the script part anywhere and only the DAG definition is the only thing specific to Airflow (which is... trivial most of the the time). This might not work if you have complex dependencies between your scripts - mine were always fairly linear jobs like: move data to cloud, train ML model, run batch model outputs, do something with the outputs, update some API. I'll just say... if you're currently running C# console and Python script jobs on Windows Server Scheduler (which is where I'm coming from, lol!), Airflow is an awesome tool that's super easy to get started with. We didn't end up using it because it was Linux-only and our infra team is scared of Linux (and Docker... and WSL2...).
@BryanCafferky
@BryanCafferky Год назад
Thanks. Lots of good comments. My point is about parsimony. Do only as much as needed and keep maintenance in mind. To create the DAGs, I believe Python must be used but from there you can call other languages. Not sure how tightly integrated other operator are, i.e. seem to just shell out but Ok. I've used SQL Server Agent for ETL scheduling and it worked great and no coding required. But in the cloud, I need to use other options like Azure Data Factory, etc. Azure also has Azure Automation but I wish Azure had a good job scheduler.
@HamzaHafeez7292
@HamzaHafeez7292 Год назад
@@BryanCafferky Having worked extensively on Airflow in recent months, on multiple proof of concepts, I will admit it has a fair bit of complexity to it. However, it does provides a lot of operators out of the box e.g. DockerOperator, K8sPodOperator. Working with those in a managed environment like AWS MWAA (Managed Airflow) has made things very straightforward for us. We have been using our pre-cooked Spark Docker Images to carry out all the tasks on runtime. It does require fair bit of training to understand how to best use it. And COST yes. The COST is expensive. But we were able to get started with Airflow on AWS in couple hours and were testing out Spark modules on the very day.
@abhinee
@abhinee 2 года назад
writing 800 lines of code to schedule a job in airflow..i totally agree with you..its a Pain in the wrong place
@tutkal1985
@tutkal1985 Год назад
clear and great explanation
@vilivilhunen3383
@vilivilhunen3383 2 года назад
Thanks! I struggled getting Airflow up and running - it seems like a really complex system. I'll take a look at Apache NiFi instead :)
@MSPalazzuoli
@MSPalazzuoli 2 года назад
Thanks, best explanation ever!
@BryanCafferky
@BryanCafferky 2 года назад
You're welcome!
@H1d3AndSeek1
@H1d3AndSeek1 2 года назад
Very interesting video. What would be a suitable orchestrator to use if e.g. our stack for ELT is Fivetran and dbt. While yes, we might be able to hook up these individual tools directly, I feel an overarching orchestrator ("dag job scheduler") is needed. So, I am not interested in using Airflow as ETL/ELT but I always thought it would only be an orchestrator tool. Cheers
@ben.morris
@ben.morris 2 года назад
Thank you for POV. Take a look at Dbt too from Fishtown Analytics. I think version control needs to be a core requirement for any tool that is responsible for moving data. This might be a problem if the solution isn’t code-based.
@BryanCafferky
@BryanCafferky 2 года назад
Thanks for your feedback. Do you use dbt or work for Fishtown? Source code control can take many forms. SSIS stores its programs as XML which can be placed under SSC. The level of and need for SSC depends on the project requirements. For example, in a small shop where one person maintains the code, ease of use and a GUI may outweigh the need for SSC assuming the ETL object snapshots can be stored.
@AP-nq4pe
@AP-nq4pe 2 года назад
@@BryanCafferky Latest version of SSIS I checked, does version control and CI/CD like a pro!
@josuevervideos
@josuevervideos Год назад
great video!! thank you
@falcon20243
@falcon20243 Год назад
Thanks Bryan This is a good video.
@goutham4678
@goutham4678 Год назад
KubernetesPodOperator can be used to run any docker images using Airflow.
@ForestFWhite
@ForestFWhite Год назад
Good comparisons. Python has the best/easiest frameworks (pandas, pyspark, et al.) for data transformation so that isn't a limitation.
@SagarSingh-ie8tx
@SagarSingh-ie8tx Год назад
You are correct 👍
@awadelrahman
@awadelrahman 2 года назад
Does AWS StepFunctions Service fit anywhere within those alternative options?
@BryanCafferky
@BryanCafferky 2 года назад
I have not used them but from the docs, yes, it looks like Step Functions would be a good option.
@ivarec
@ivarec 2 года назад
Your channel is awesome (and I'm very picky). I've recommended it to my whole team and I'll try to get our company to help you on Patreon as well. Keep it up!
@BryanCafferky
@BryanCafferky 2 года назад
Thank you so much!
@sakesun
@sakesun 2 года назад
Agree with the video.
@davidgao4333
@davidgao4333 Год назад
I use Airflow, too, but I totally agree with Bryan's point of view. Airflow is a powerful tool, but the other side of the coin is its steep learning curve especially for new Airflow users. Most of the time I just need to do simple stuff and I find using Airflow leads to over-engineering. Lots of people uses Kubernetes Operator, the biggest problem I see with it is a lot of times I have one common base Docker image, but I need to bundle different code into that Docker image just for the sake of using Kubernetes Operator.
@BryanCafferky
@BryanCafferky Год назад
If you are doing Databricks, the new workflows are pretty easy to use to create task orchestration. Thanks for your comment. .
@davidgao4333
@davidgao4333 Год назад
@@BryanCafferky Thank you for posting this video. It's THE best video that I've encountered that explains what Airflow is. A lot of people in my company uses Airflow for use cases that are not fit for Airflow..
@BryanCafferky
@BryanCafferky Год назад
@@davidgao4333 Glad you liked it.
@Praveen_Kumar_R_CBE
@Praveen_Kumar_R_CBE 2 года назад
Very true..
@brendoaraujo9110
@brendoaraujo9110 2 года назад
Hello, I have an airflow running on my machine with Postgresql on the scheduler's backend and LocalExecutor, but when I put my dags to run it consumes a lot of server CPU, how could I solve this high consumption problem?
@BryanCafferky
@BryanCafferky 2 года назад
If you are running it all on your machine, then it sounds like your machine may not have enough power to support it. You could deploy Airflow to cloud VMs or Kubernetes cluster to get more resources. This Stack Overflow talks about limiting Airflow memory consumption. stackoverflow.com/questions/52140942/airflow-how-to-specify-quantitative-usage-of-a-resource-pool This blog discusses how to configure Airflow with setting for max_threads, worker_concurrency, etc. medium.com/@sdubey.slb/high-performance-airflow-dags-7ad163a9f764
@jenya7united
@jenya7united Год назад
Hi I am working with Pentacho, can you make a video on it ?
@rodrigoloza4263
@rodrigoloza4263 Год назад
Airflow is great. Coupled with Kubernetes you don't have to stick to Python anymore. The only drawback I saw was that DAGs don't scale when they have huge amounts of tasks. Though it's easy to solve by splitting the DAG.
@BryanCafferky
@BryanCafferky Год назад
Thanks for the comment. You do have to define the DAGs in Python. What do you use Airflow for?
Год назад
There are different ways to use Airflow, you can rely on Kubernetes Pods to run Docker instances. In recent versions, you can scale schedulers to solve task issues. Nowadays, anyone can give an opinion just by reading Wikipedia and some basic examples. It's not an ETL solution. It's just an orchestrator with batteries.
@IgorLucci
@IgorLucci 2 года назад
very good!!
@TheUnderdogr
@TheUnderdogr Год назад
I think there's a misunderstanding. Airflow is NOT an ETL tool, and I don't think it was ever meant to be, or marketed as such. It's rather an unfortunate confusion in the minds of many, between the workflow management / orchestration (which Airflow DOES), and the ETL tasks that actually implement the data transformations which make up the ETL pipeline (which should Not and usually are not airflow tasks). With Airflow on AWS service we run nightly data ingestion of rdbms data in AWS S3; all the tables in a given schema are processed in parallel airflow tasks, but each of these tasks is just calling an informatica script which actually does the job of ingesting a given DB table. So, again, if people don't understand the meaning of orchestration, don't blame the tool 😁
@ricardorodriguez4180
@ricardorodriguez4180 2 года назад
This is a cool vid. I personally love Airflow, I use it mostly as an interface to k8s and run applications on pods. I think in terms of "if I can write a container for the task, then I can orchestrate it in Airflow". You're not wrong though, it did take time to learn the intricacies of Airflow (both in code and UI). Our company-practice is to make reusable functions that generate DAGs, reduces the code for creating workflows per use-case down to just a function call. Thanks for putting this video together. I learned about some good alternatives.
@BryanCafferky
@BryanCafferky 2 года назад
You're welcome and thanks for your feedback.
@razyuval
@razyuval Год назад
First, Thanks for this video and thanks for your insights. I understand why you said some things but I don't agree with most of it. You're right Airflow is a great job scheduler, not an ETL/ELT tool. But from my experience, neither is Nifi, not if you want to do some long complex batch jobs, each block is autonomous and they don't wait for the previous one to complete (The others I haves very little experience with so apart from being pretty expensive...). I think the strength of Airflow, the reason I choose to use it, is the level of control you get, and the diversity of job/tools you can use. It can start with a bash calling a Talend Job that loads your DB and then a DBT job that processes it. You can further split your DBT into tests and loads and when there's a failure rerun from the point of failure. These are features I saw in expensive enterprise tools such as Control-M. It does have a steep learning curve but looking at the trends in the market today and the way teams are being structured, Engineers for infrastructure and Analysts for the BI part, I think its a good choise.
@BryanCafferky
@BryanCafferky Год назад
Thanks. NiFi is documented as only an ETL tool and seems to fit that from what I read, though I have not used it. As I discuss late in the video, Airflow can be a good choice as a scheduler if you need the sophistication, i.e. DAGs, it offers. I purposely titled the video to alert people that think Airflow is an ETL, that it is not. That’s what I wanted to use it for and after reading a book on it, realized, its not an ETL tool. It is a Workflow engine. There’s a similar one in Windows that works with C#. Its fine if that’s what you need. Airflow seems great for complex ML pipelines. On SQL Server, I have used SQL Server Agent which worked well for that environment. It had sufficient dependency management and control for most jobs. The best ETL service to use depends on your environment and requirements: Databricks Notebooks for Sparks, Azure Data Factory for Azure Cloud, Pentaho, Informatica, etc. I appreciate the feedback. Good thoughts.
@najbighouse
@najbighouse 3 месяца назад
Which tool is recommended for a project where you have to be calling this jobs every 20secs? I suppose this is better for tasks that run once or twice a day and not in a constant loop. right? only 10% of my tasks are daily or weekly. any recommendations?
@BryanCafferky
@BryanCafferky 3 месяца назад
If the job is constantly running, then an orchestration service seems to be unnecessary. Perhaps you should consider using a streaming source.
@sau002
@sau002 2 года назад
Nicely done. Airflow is being explored by one of our team members. I have a question for you - Is it possible to debug the code on my local workstation before running it on Airlfow?
@BryanCafferky
@BryanCafferky 2 года назад
Well, you can run Airflow locally, see airflow.apache.org/docs/apache-airflow/stable/start/local.html To test without Airflow, remember that Airflow just runs Python code in the specified sequence so you should be able to test that code. Just run it in the order it will run when it is in Airflow.
@DanielRodriguez-el1gb
@DanielRodriguez-el1gb Год назад
Im having a problem handling around to 50 scripts that generates reports that are send to the users, i would like to schedule them but also activate them from some microservices, is there any suggest for that? :(
@BryanCafferky
@BryanCafferky Год назад
50 scripts generating and sending reports is probably not the idea solution. A reporting tool would make more sense. However, you could use Azure Automation which supports Python and PowerShell to do scheduling and run the scripts. Azure functions could also be used.
@vitorsilva-or1dj
@vitorsilva-or1dj 2 месяца назад
thanks
@teenspirit1
@teenspirit1 Год назад
DAG is a bad name for a task schedule. 1. Directed graphs are obviously necessary if you want to define an execution flow, so duh. 2. The fact that you have a schedule interval means that your DAG isn't really acyclic, because it loops onto its own start node at the end node. 3. Acyclic graphs are good for reducing complexity and dependency between tasks, and that's a great thing. But that's an actual restriction, a lack of functionality, so it isn't really a feature.
@rursus8354
@rursus8354 2 года назад
Good video! Besides, singular of "vertices" is "vertex." Because it is Latin.
@dragon3602010
@dragon3602010 2 года назад
hey , can we compare it to n8n or absolutely not ?
@BryanCafferky
@BryanCafferky 2 года назад
Yes. I had to look up n8n but it seems to be better focused on ETL work and has many connectors. However, it does not appear to run on Spark so you would need to config a Docker/K8 environment or use their Cloud service which is in the Azure Marketplace.
@abc8879
@abc8879 2 года назад
"you are only limited to python" -- I don't think this is a bad thing. Python is a stable and versatile language with libraries for everything. "it's complex" --- If the developer already knows Python, imo, airflow isn't difficult to learn. "Requires 100% coding" -- I see this as an advantage. I'm using both Airflow and Pentaho. With Pentaho, code review is just painful because the raw code is in XML which makes it difficult to read and to keep track of. Also, there's not a huge user base like python or airflow. So there isn't much help out there on stackoverflow.
@BryanCafferky
@BryanCafferky 2 года назад
Thanks for the feedback.
@guyvleugels8507
@guyvleugels8507 2 года назад
I'm not against visual programming or low code tools, but if your team are experienced developers, they'll be more productive and happier using all code. Nevertheless. ETL tools have their place. Let your teams use the tools that are most suited for the job and use Airflow as a centralized orchestrator. You can orchestrate ADF pipelines for instance.
@chasedoe2594
@chasedoe2594 Год назад
Coming from legacy ETL. I am kind of confused, as you said a lot marketed them as ETL tools, and when I look closely, I totally agree with you that it is CRON on steroids. I guess it is marketed as ETL as python had pandas is relatively easy to ingest data compared to other frameworks but doing real heavy ETL on pandas is not a perfect way to do.
@BryanCafferky
@BryanCafferky Год назад
Depending on your needs and environment, you can use different tools, Azure Data Factory for Azure, AWS Glue, Databricks Notebooks for Databricks which runs on Spark. Pentaho, Infomatica, etc. Lots of choices.
@yahyaayyoub9959
@yahyaayyoub9959 2 года назад
Can not compare Apache Airflow With Apache Nifi These two tools aren’t mutually exclusive, and they both offer some exciting capabilities and can help with resolving data silos. It’s a bit like comparing oranges to apples - they are both (tasty!) fruits but can serve very different purposes.
@msingh1319
@msingh1319 2 года назад
Hi Bryan, the gcp Gui etl option is datafusion.
@BryanCafferky
@BryanCafferky 2 года назад
Thanks for that. Good to know.
@v4ldevrr4m47
@v4ldevrr4m47 Год назад
Thanks is totally true that can use airflow as great ETL you need an effort focus in python. When you are a developer that use python and can prepare sql querys result perfect. Any way I will consider NiFi because I don´t Know it. Let me read about it
@joshi1q2w3e
@joshi1q2w3e 2 года назад
Just to make sure I’m understanding correctly: 1. If a company uses Azure or AWS they can just use ADF or AWS Glue instead of Airflow? If so it seems that Airflow is more for companies who do end-to-end python ETL/ELTs and don’t wanna pay for ADF or AWS Glue? 2. I’m a bit confused between what you said because a couple articles on the web and answers on Quora say that Apache NiFi is NOT a replacement for Apache Airflow. So are there things that Apache NiFi can’t do that Airflow can? 3. I really don’t wanna learn Airflow because of the learning curve but some jobs do require it :/ so if Apache NiFi can replace it I’d rather just use that. Do you know of any good resources to learn Apache NiFi or do you plan on making videos on it?
@BryanCafferky
@BryanCafferky 2 года назад
Thanks for your thoughts. Airflow can be a good solution but my point was that it is not an ETL tool. It is a job scheduler or orchestrator. It is promoted often as an ETL solution which I think is misleading. But yes, I too see jobs that ask for it. For complex workflows, it may make sense, especially streaming or something with complex dependencies. Bear in mind a given workflow cannot run concurrently with itself, i.e. each run must go from start to finish before it can start again. I would google NiFi or check Amazon for books. The documentation online looks pretty good. NiFI videos might be something I'll do in the future. It looks pretty cool.
@halildurmaz7827
@halildurmaz7827 Год назад
As I know, Airflow is used for "scheduling" the ETLs; not "creating" the ETLs. So, can you perform both "creating" and "scheduling" operations via AWS Glue?
@BryanCafferky
@BryanCafferky Год назад
I've not used Glue but the docs say you can. "AWS Glue can run your ETL jobs as new data arrives. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs." For time based scheduling see docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html
@halildurmaz7827
@halildurmaz7827 Год назад
@@BryanCafferky Thank you so much for your attention. Then, if you are working for a company that uses a Cloud Platform, actually you do not even need Airflow.
@BryanCafferky
@BryanCafferky Год назад
@@halildurmaz7827 YW. If you just need to do ETL work, you don't need Airflow. If you need complex task orchestration, i.e. workflows, Airflow might be a good option.
@bres6486
@bres6486 Год назад
I don't think social media is a good example of a DAG since in general if a is connected to b, b is connected to a, which are bidirectional (non-directed) edges. I suppose if you impose who connected to who first then you could keep it directed but that seems artificial.
@rick-kv1gl
@rick-kv1gl 2 года назад
ur channel is underrated.
@BryanCafferky
@BryanCafferky 2 года назад
I didn't know it was rated but hope you find it useful.
@rick-kv1gl
@rick-kv1gl 2 года назад
@@BryanCafferky def. its a hidden gem. thanks for content!
@BryanCafferky
@BryanCafferky 2 года назад
@@rick-kv1gl Thanks. Please let others know about my channel.
@zacharyedwards665
@zacharyedwards665 Год назад
Got fucking boomed by the title
@steinofenb3645
@steinofenb3645 Год назад
What does ETL stands for in ETL Service? (at min 4:02)
@BryanCafferky
@BryanCafferky Год назад
It stand for Extract, Transform, and Load.
@programminginterviewsprepa7710
@programminginterviewsprepa7710 2 года назад
Many times all code much better than no code - much better version management code reviews and existing code readability
@BryanCafferky
@BryanCafferky 2 года назад
But what if you can do no code faster, cheaper, and with less bugs?
@samsal073
@samsal073 Год назад
I agree apache airflow is pain in the butt to learn , install and figure out the code ...one big limitation is that it doesn't support windows system unless you run it inside docker container ...I would rather using apache nifi since it can run on windows ....support multiple scripting languages and its UI oriented vs code which make it more productive and much easier to use
@BryanCafferky
@BryanCafferky Год назад
Nifi may be a good option. Databricks Workflows are also a good one. See my video on it. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-tMH3K8Rncmk.html
@janekschleicher9661
@janekschleicher9661 Год назад
I think a huge characteristics of Airflow is that is a static tool (what I personal really don't like, but let's try to keep it neutral). If you want to change something, you'll need to change the underlying code deployed to the server where Airflow is running. This means to first take all the procedure to change the code, review it, run it in CI/CD development and then ship the code to the server (or probably just redeploy the server). That's a long process, you can take some shortcuts, but you'll never have the experimental mode or fast prototyping possible. Even when working with a test instance, still it's a slow process. For some use cases, that's great, because there is always a definite and reliable and versioned description of what's going on. But if you need to change workflows and aren't sure whether they work fine (e.g. because the production cluster is different in terms of performance than your development cluster, or w/e), the development speed goes down drastically. Even if you don't want to try it out live, you either have a lot of latency going to the development cluster or you need a huge machine as you need to put it locally in a K8s setup (for realistic scenarios in enterprises). There are benefits having everything in code and inside GitOps, but it's certainly not fast prototyping for sure. The comparison to cron is very true. The only way to really check that it runs is to deploy it (like for cron, too), but you should only deploy what you are sure that it runs, so it's a chicken-egg problem. You can run tests, but they don't look the same way as usual in Python or in ETL or in SQL databases or in pandas, and they are complex to write and failure modes might be difficult to understand (especially checking all possible triggering rules). I personal would in most cases prefer a dynamic tool, I could easily change while running. (You might still want to block changes on the production system, but for at least for development or staging environment, this is what I really missed when working with Airflow). But year, the visualizations are awesome and explaining the complexity of a system to stakeholders works much easier. So, in practice, you'll get a lot of acceptance if work is slow, so this counterbalances it significant.
@BryanCafferky
@BryanCafferky Год назад
Thanks for the info.
@himanshutech8320
@himanshutech8320 21 день назад
Thanks. Excellent video. I recently moved to a data engineering project that uses airflow with DBT ( and cosmos ). Finding it difficult to understand why use airflow esp. with ELT tool like DBT. For any task, there is dependency on available operators if you want to use airflow. python code is tightly coupled with airflow. and as in the video it says - you have to code everything. its not that you can't get work done with airflow and DBT but with something like pentaho you would have done it with half the effort.
@BryanCafferky
@BryanCafferky 18 дней назад
Thanks for your comment. I would suggest also looking at Dagster They address many of my concerns with Airflow. dagster.io/ Not sure how well is works with Databricks clusters though.
@himanshutech8320
@himanshutech8320 9 дней назад
@@BryanCafferky Thanks. Will check !!
@yoyartube
@yoyartube 7 месяцев назад
With the BashOperator and custom operators I find it hard to understand how it only supports python.
@BryanCafferky
@BryanCafferky 6 месяцев назад
True with the bash operator it can do an OS call out to run a script but that's not tight integration. Your DAGS are defined in Python and Airflow is a Python framework. Thanks for your feedback.
@DaveAlbert54
@DaveAlbert54 2 года назад
I think you are mistakenly comparing Airflow to AWS Glue where AWS Step Functions (maybe also with AWS Glue) are a better representation of what it seems you get from Airflow. I'm not an expert in Airflow, but based on what is shown here in this video, that is the impression I get.
@BryanCafferky
@BryanCafferky 2 года назад
Thanks for the feedback. My intent was that ETL focused tools include SSIS, Informatica, Databricks, Azure Data Factory, NifI, Pentaho, etc. Airflow is a workflow orchestrator. I saw many places where it is promoted as an ETL service. It is not an ETL tool although it can be used to orchestrate ETL work. However, unless there are many task dependencies, it is probably overkill.
@andalupu6145
@andalupu6145 2 года назад
hi, please allow me to add that I used Airflow in order to run complex queries in Impala using .sql files (that contain Impala query) and run inside the DAG tasks in the needed order. This might be usefull, for me it was . I agree that Nifi is best and my favorite. Thanks
@BryanCafferky
@BryanCafferky 2 года назад
Thanks. Yes. Sounds like you had a good use case for Airflow.
@vpn740
@vpn740 9 месяцев назад
The entire functionality of Airflow is already available in tools like Control-M, Zeke, AutoSys etc, which have been present in the market for more than 2 decades. What is it that Airflow is doing differently? It seems, the programming cult has taken over the data processing and data management world and re-writing all the tools in the way that it was in 1980s. We intentionally came out of code heavy data processing/management model because of its heavy and expensive maintenance costs. Almost 18 yrs ago, in the early days of my career I worked on an "ELT" tool called Sunopsis (later acquired by Oracle). Today we are lauding a similar technology called "dbt" which is doing exactly what Sunopsis did 20 years ago. what's going on folks?
@BryanCafferky
@BryanCafferky 9 месяцев назад
Good feedback. Not sure about dbt. It seems to offer quite a bit for ETL, less so for scheduling/orchestration.
@ridwantahirhassen197
@ridwantahirhassen197 Год назад
we have extensively used airflow, it is AMZING. I think the whole video revolves around "worflow orchestration is not that complicated and is of secondary importance", which is not usually the case. For "complex" workflows, using configurations is not any simpler or more neat than writing python scripts. It is also important you test your workflow. Airflow has that functionality. The ui feature is very handy. Restarting jobs, clear visibility into what happened, etc. It also scalers really well! This videos is a little misleading!
@BryanCafferky
@BryanCafferky Год назад
Thanks for your comments. Did you watch the video? That's not what I said.
@bettatheexplorer1480
@bettatheexplorer1480 2 года назад
I love airflow.
@BryanCafferky
@BryanCafferky 2 года назад
So do I.
@paulellicapadilla3421
@paulellicapadilla3421 11 месяцев назад
I’ve worked with the alternates you mentioned and you’re missing one other product that surpasses all of them. That is: Dagster
@BryanCafferky
@BryanCafferky 11 месяцев назад
Yeah. Looks interesting . Do you work for Dagster?
@paulellicapadilla3421
@paulellicapadilla3421 10 месяцев назад
@@BryanCafferky Nope. Don’t work for Dagster. Just a mild manner data engineer trying to weed out all the noise in the tech world, finding the right gems so I can focus on exploiting those gems to be productive and be ahead in the game. Unfortunately, most of my time is spent on weeding out noise. I thank you for your service for doing the same. I think the road to take in discovery of new tools is to ask the question “why a tool is bad”, rather than, “why is this tool good”.
@macbeth1910
@macbeth1910 2 года назад
Sorry but there are many misleading statements here. Firstly, you are not coerced to use python in your tasks, you can perfectly orchestrate almost anything if you put your code in an image (so yeah, you can use NodeJS, Java, etc). The learning curve is nothing more complicated as the one to learn any other framework, like Django (obviously we are in the "data processing" domain here). Most of all, it is a powerful tool to organize your tasks when using a bunch of cron-jobs in microservices is not an option
@BryanCafferky
@BryanCafferky 2 года назад
Thanks for your comment. Your code to orchestrate must be Python, which is a limitation. Parsimony is key. For a given project, the question 'Do I need to take on the overhead of creating and maintaining code just to orchestrate work'. Code which can break. Absorb the learning curve time and future skill set needed for employees. It is powerful but with great power comes great responsibility I don't think most data movement/transformations cases need Airflow.
@OgnyanDimitrov
@OgnyanDimitrov 2 года назад
@@BryanCafferky The validity of the reasoning is best observed if you compare Airflow with Alteryx and contrast them. Then we really see the difference of the learning curve. Alteryx and Kettle allows for non-devs to make ETL pipelines and the learning curve for non-devs is shorter. Am I correct on my assumptions? Thanks for the video. It was a time saver really.
@BryanCafferky
@BryanCafferky 2 года назад
@@OgnyanDimitrov Yes. You got it! Thanks
@guyvleugels8507
@guyvleugels8507 2 года назад
I don't really understand Python being a limitation here. It's just the technology and ecosystem Airflow is using. SSIS, ADF, Pentaho,... They all have their limitations in the ecosystems they are sitting in. As for maintaing code.... Same applies to SSIS, ADF,... Only you build logic using a visual tool instead of all code. Airflow has lots of pre-built provider packages for database actions, ADF, Databricks, non-data related stuff,... which you can use so you don't need to build tasks from scratch. Thanks for the vid btw. Your other points were valid. Airflow is indeed an orchestrator, not an ETL tool. 😊
@jamescaldwell3207
@jamescaldwell3207 2 года назад
I would argue that the useful functions should be called into the airflow context from a separate module. With this methodology python could be used to run code outside airflow support. Am I missing something?
@BryanCafferky
@BryanCafferky 2 года назад
What are you responding to specifically?
@jamescaldwell3207
@jamescaldwell3207 2 года назад
@@BryanCafferky Reusability of code utilized by airflow. For context, I landed here while listening to arguments for and against airflow because I'm trying to figure out if I'm going to learn it or Prefect. I don't know much about either, hence the question at the bottom of my comment.
@BryanCafferky
@BryanCafferky 2 года назад
@@jamescaldwell3207 Did you watch the video? I have no issue with reuse. Whichever fits your requirements with the least cost/effort to maintain is probably the best tool.
@jamescaldwell3207
@jamescaldwell3207 2 года назад
@@BryanCafferky Of course I did. My comment was regarding right around 13:50 where you state that generic functions cannot be used anywhere else because of the decorators. I would think non-specific functions would be in a separate module and imported for use inside a task. If that function is specific to airflow but generic within the operational capacity of airflow, then one could create an airflow specific library for use across multiple jobs. As stated, I'm deciding whether to learn one of two tools and my comment was an assumption which posed the question if I was missing something. Having now looked it up in the spirit of ending what is starting to feel like a combative exchange, I've learned my assumption was correct.
@BryanCafferky
@BryanCafferky 2 года назад
@@jamescaldwell3207 Sorry. No worries. Glad to get the question. I recorded this video 5 months ago so not all the details are still fresh in my mind. The reference time was helpful. Your point is valid. In fact, you could create non airflow generic function libraries too. As I look back at this, I can see that when using the decorator, only the outermost function is decorated. Also, you can write code that does not use the decorators although I think the decorators are intuitive. See this blog for more details. airflow.apache.org/docs/apache-airflow/stable/modules_management.html
@pulanala1421
@pulanala1421 8 месяцев назад
Can it compete with Control M?
@BryanCafferky
@BryanCafferky 8 месяцев назад
Don't know. Never heard of Control M. Do you work for them?
@pulanala1421
@pulanala1421 8 месяцев назад
@@BryanCafferky nope ,it is a commercial scheduling tool I have used and based on your presentation everything you mentioned exactly like what control M does.A task or job scheduling tool!
@swapnilpatil6986
@swapnilpatil6986 Год назад
Wonderful video, myth busted. can you plz throw some lights on dbt tool. Its also being promoted as an ETl tool.but i am not sure of its use case.
@rdean150
@rdean150 Год назад
Surprised you didn't mention Argo as an alternative.
@BryanCafferky
@BryanCafferky Год назад
There are many alternatives. Too many to cover them all. Thanks for the suggestion.
@llorulez
@llorulez Год назад
in the current startup i work is good enough, not expensive and easy to use
@BryanCafferky
@BryanCafferky Год назад
Thanks for the comment. Yes. It does do a lot. What other Workflow engines were considered?
@llorulez
@llorulez Год назад
@@BryanCafferky mainly kubeflow but our company is not that big to use fully dedicated kubernetes clusters. Any tool you would recommend? Interesting video btw.
@BryanCafferky
@BryanCafferky Год назад
@@llorulez Thanks for the info. It all depends on what you need to do. The video was meant to get people to stop and think before jumping in as Airflow is pretty complex but can be a great solution. For ETL/Data Movement, if the workflow is sequential, I would use a simpler tool which I mention in the video. Databricks notebooks/jobs can work well but it depends on whether you need the scale. Dask looks good for non-Spark loads and is really easy to start with but gets complex with the scale out. Each public could has their own ETL PaaS services as well. My focus is parsimony, i.e. just enough to do what you need and no more.
@llorulez
@llorulez Год назад
@@BryanCafferky maybe it was easy for me because we extensively use docker and was quick using dockeroperators but as you mention it can be really challenging.
@cmcmahon1978
@cmcmahon1978 Год назад
Dear lord ... Please dont use ADF over Airflow unless you are doing a deployment pipeline. Unless you enjoy working DEEEP under the covers doing things like spinning up Powershell jobs to complete tasks in an environment that is not really strongly backed by Source control .... unless you want to link it to a git repo and stare at JSON blobs to figure out whats wrong with the underlying "code". I do agree with you that there is a finite set of things that Airflow is good at and things that it shouldn't be used for out of the box. I wholeheartedly disagree that the "python" needed to do many of the simpler dag use cases are difficult to accomplish as most of the out of the box operators are pretty thoroughly documented and example code on how to use them lives everywhere on the internet. I would say that even in the case that you want to do something that Airflow doesnt do directly out of the box there is always the ability to use the numerous Python Operators to run custom code, Or the ability to spinup Kubernetes Pod Operators and allow them to scale in the cluster for heavier ML tasks. "Use Databricks" ... yes you can ... but databricks is a potentially expensive way to orchestrate one thing. Whereas airflow can not only orchestrate Spark but do many things that Databricks cant do. Also ... at the end of the day Databricks just winds up being a bunch of JSON. I think the ETL code you show is a fairly ok example of example code ... but not really an example of how an ETL process would be setup in the real world. Nor are you showing many of the purely built in operators that will allow you to orchestrate jobs in a tremendous amount of services in one centralized place. Mostly IMO yes ... if you dont want to write any code dont use airflow ... If you are ok with some mostly cut/paste code for many basic dags and functions but also want the ability to do things that none of the other mentioned tools that I have personally looked at can do ... I would give airflow a shot ... Or ... if you arent into doing ANY of the management work look at a managed airflow service.
@XxXxXboxLivexXxXxX
@XxXxXboxLivexXxXxX Год назад
I evaluated airflow and luigi(which you didn't mention), I feel that airflow is the one that has enough extendability to get to work with my company's compute resources/environment. It seems you just went through the tutorials and didn't implement anything significant in airflow. The limitations you mention seem a little arbitrary(most people like python) and I don't understand how these are resolved with other options or what associated tradeoffs I would be making. Still going to use airflow, this is clickbait.
@BryanCafferky
@BryanCafferky Год назад
Thanks for your thoughts but I have asked colleagues who have used Airflow extensively and they agreed with my points. Also, most of the viewers of this video who left comments also agreed and confirmed with their experiences. It not about Python, it's about the best solution to a problem. Sometimes that will be Airflow but for most use cases, I don't think is and I get concerned when people get defensive about a given technology. BTW: It's not click bait when you follow though with a content that is consistent with the title. Live Long and Prosper.
@kaanmutlu4953
@kaanmutlu4953 Год назад
My exact thoughts... literally a job scheduler on roids...
@MattCamp
@MattCamp 2 года назад
did you really delete my comment.. wow.. I didn't even say anything bad.. just that I disagreed and thought you were wrong..
@damarh
@damarh Год назад
I am actually looking for a scheduler to run python scripts, but if tha means i have to wrote MORE python... good lord.
@pmsanthosh
@pmsanthosh Год назад
Kettle by pentaho is slow
@kelvinsanyaolu4899
@kelvinsanyaolu4899 2 года назад
Im actually looking at prefect
@BryanCafferky
@BryanCafferky 2 года назад
Do you work for them?
@kelvinsanyaolu4899
@kelvinsanyaolu4899 2 года назад
@@BryanCafferky Nope not at all but have been setting up some light jobs on it and though it's similar to airflow, I like the fact that your code is completely decoupled from the cloud environment. Overall it seems easier to use than Airflow
@BryanCafferky
@BryanCafferky 2 года назад
@@kelvinsanyaolu4899 Thanks. Good to know.
@luabida
@luabida Год назад
I think the right title should be "Don't Use Apache Airflow if you are a Data Scientist" Bc as a DevOps Junior, Airflow looks awesome compared to cron scripts, at least in the project I'm in
@BryanCafferky
@BryanCafferky Год назад
Could be. I am finding that installing and configuring Airflow can be challenging. I only see one SaaS offering on Azure for it and it starts at 45K.
@luabida
@luabida Год назад
@@BryanCafferky Yeah you are totally right, I'm trying to implement it in a docker project with conda deps, and hell, this is hard
@BryanCafferky
@BryanCafferky Год назад
@@luabida Thanks. I was wondering if it was me. :-) Usually, just to get a basic dev environment for a tool is easy but not this. Python Dask is a piece of cake and for Spark, you can just use Databricks Community Edition.
@luabida
@luabida Год назад
@@BryanCafferky Works perfectly when you build the system based on it, but the thing is that I need to execute python modules from outside airflow's container. Which I think the best way will be to define every single dep I need in Airflow's Dockerfile so it can run the tasks
@BryanCafferky
@BryanCafferky Год назад
@@luabida Yeah. I think that makes sense. Reach out on LinkedIn if you would like to connect. I'd be interested in following you progress on this.
@eth6706
@eth6706 Год назад
Azure data factory is far superior in my experience. Airflow isn't terrible though.
@rjribeiro
@rjribeiro 2 года назад
- I thought it was obvious that Airflow's use case is to be the orchestrator of a data pipeline, not the executor. Who uses Airflow for ETL/ELT is using wrong. - I don't see a problem with being a code oriented tool, as Python is very easy to learn. Is almost Low Code. - Comparisons with "best options" were meaningless. The use cases are different. It would have been more logical to have quoted Prefect, perhaps Dagster
@edpearson5464
@edpearson5464 2 года назад
Python as low code was a good laugh to start my morning, thanks.
@Munchopen
@Munchopen 2 года назад
Seriously. Why should it be a downside that it is 100% and requires DevOps to manage? Mostly these days that's what we wan't because it gives 100% control over what happens in the flows and how it happens. I see tons of ETL tools (which often are also job schedulers) being used mainly as schedulers with a lot of workarounds or custom SQL around them, simply because they don't facilitate or support the most basic things needed to do heavy transformations etc. If these out of the box "drag and drop" tools are so good, why do I keep seeing people implement workarounds on them? It doesn't make any sense these days. IMO I would rather see more code on ETL work than less and have good written code to do the job for us. Code that I can audit and move around like I want to. Also, I think you aren't giving justice to AirFlow at all. Saying that it is complicated to use is really hilarious. All things take time to learn mate. When you know the tool well, it takes no time to do things in. That's how all things work in this world.
@BryanCafferky
@BryanCafferky 2 года назад
Thanks for your comments. The gist of my thoughts is parsimony. Use the solution that requires no more work to maintain than necessary while meeting the requirements. If Airflow is needed, great! I think it not needed in most situations.
@pguti778
@pguti778 Год назад
Wow, Airflow is really bad... Same as Airbnb. Crazies the ones that uses it.
@qinlingzhou8815
@qinlingzhou8815 10 месяцев назад
Pentaho? Give me a break please! We are finally getting rid of it. R U just talk without doing!
@BryanCafferky
@BryanCafferky 10 месяцев назад
Manners please. I have over 30 years experience and have done tons of ETL work. I was just offering alternatives. How much experience do you have? Just curious.
Далее
Apache Airflow Architecture 101
18:29
Просмотров 10 тыс.
The Newcomer's Guide to Airflow's Architecture
27:26
Просмотров 23 тыс.
ААААА СПАСИТЕ😲😲😲
00:17
Просмотров 1,8 млн
Кошка-ГОНЧАР #шортс #shorts
00:28
Просмотров 737 тыс.
Жидкие носки)))
00:19
Просмотров 827 тыс.
Airflow Vs. Dagster: The Full Breakdown!
14:51
Просмотров 6 тыс.
Apache Nifi Crash Course
1:30:35
Просмотров 190 тыс.
Airflow DAG: Coding your first DAG for Beginners
20:31
Просмотров 214 тыс.
Apache Airflow vs. Dagster
6:57
Просмотров 11 тыс.
Core Databricks: Understand the Hive Metastore
22:12
Просмотров 13 тыс.
Airflow Vs. Prefect: Full Breakdown!
17:41
Просмотров 4,6 тыс.
iPhone 16 - КРУТЕЙШИЕ ИННОВАЦИИ
4:50
сюрприз
1:00
Просмотров 1,7 млн
iPhone 16 - КРУТЕЙШИЕ ИННОВАЦИИ
4:50