Тёмный
No video :(

18 Understand DAG, Explain Plans & Spark Shuffle with Tasks 

Ease With Data
Подписаться 4,6 тыс.
Просмотров 2,9 тыс.
50% 1

Опубликовано:

 

22 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 33   
@sureshraina321
@sureshraina321 9 месяцев назад
Omg this is serious stuff , I'm sure no online tutors are teaching this much depth and I'm glad that I found your channel 😍
@easewithdata
@easewithdata 9 месяцев назад
Thank you 💗 Make sure to share with your network and tag us.
@sarthaks
@sarthaks 7 месяцев назад
Very thorough , detailed explanation of spark internals.... never seen such content before.. Nice job!!
@easewithdata
@easewithdata 7 месяцев назад
Much appreciated!
@sarthaks
@sarthaks 7 месяцев назад
I have two follow-up questions: 1- How to avoid shuffle. Does shuffle really impact the performance in all scenarios. 2- Given a DAG or say the explain plan, what are the areas or steps that one needs to take care for performing performance optimization. @easewithdata
@easewithdata
@easewithdata 7 месяцев назад
1. Yes shuffle impacts performance, but we cannot avoid shuffle in all cases (consider aggregations). We need to optimize those so that the step behave perfectly. 2. There are many steps for optimization of a JOB in Spark. The first step starts with Explain Plan/DAG. It only provide the way spark is going to execute the code. But each step might have some different bottlenecks that you need to look out for (e.g. skewness/spillage)
@chandrasekharreddy3617
@chandrasekharreddy3617 9 месяцев назад
Perfect explanation of DAG in detail with easy to understand manner. Thanks, Now I get more confidence than earlier.
@easewithdata
@easewithdata 8 месяцев назад
Glad it was helpful!
@aashisharora3536
@aashisharora3536 4 месяца назад
why didn't I read your blog earlier and landed to your channel, your explanation is superb man, please keep posting videos
@easewithdata
@easewithdata 4 месяца назад
Thank you, please make sure to share with your network as well.
@Ravi_Teja_Padala_tAlKs
@Ravi_Teja_Padala_tAlKs 7 месяцев назад
Seriously Super bro. Keep going and Thanks for all this 🎉
@easewithdata
@easewithdata 7 месяцев назад
Thanks 👍 Make sure to share with your network as well ☺️
@ravulapallivenkatagurnadha9605
@ravulapallivenkatagurnadha9605 7 месяцев назад
Great explanation Please continue uploading videos.
@easewithdata
@easewithdata 7 месяцев назад
Please check the playlist for next videos.
@ashishsahu4025
@ashishsahu4025 4 месяца назад
Nice work bro
@ateetagrawal9928
@ateetagrawal9928 9 месяцев назад
Very very informative video
@easewithdata
@easewithdata 9 месяцев назад
Thank you, please make sure to share with your network 🛜
@ComedyXRoad
@ComedyXRoad Месяц назад
thank you in real time do we use cluster node or cline mode which you are using now?
@easewithdata
@easewithdata Месяц назад
I am using the client mode
@rakeshpanigrahi577
@rakeshpanigrahi577 2 месяца назад
Thanks for the awesome explanation. I ran the exact code in Databricks, but it skipped the repartitioning step somehow. It is not showing the relevant steps for repartitioning in the DAG or in the explain plan.
@easewithdata
@easewithdata 2 месяца назад
In order to replicate the same behaviour just disable the AQE
@ansumansatpathy3923
@ansumansatpathy3923 2 месяца назад
Why is there a shuffle write for the read stage from files to a dataframe? Does that involve a shuffle? Also a shuffle write for only kbs worth data?
@easewithdata
@easewithdata 2 месяца назад
Shuffle is only involved when the next step is a wide operation. And for KBs data, it depends on the next stage. If you have count it will first make a local count before shuffling the data (which is reduced to kbs)
@at-cv9ky
@at-cv9ky 6 месяцев назад
since the default parallelism is 8, only 8 tasks can run in parallel. So can you explain how 200 tasks in the join transformation ran in parallel ?
@easewithdata
@easewithdata 6 месяцев назад
All 200 task didnt run in parallel. Batches of 8 task will run one after another untill all 200 tasks are completed
@at-cv9ky
@at-cv9ky 6 месяцев назад
@@easewithdata thanks. if possible kindly make a series on Databricks as well. Just curious to understand how it integrates with Spark.
@anupb.a983
@anupb.a983 7 месяцев назад
Doubt : As sum is also needs shuffling and join also why for sum 200 parittions are not created
@easewithdata
@easewithdata 6 месяцев назад
If AQE is enabled you will not find 200 Shuffle partitions. It will coalesce all un necessary shuffle partitions. Checkout - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-164OKvwW8T8.html
@dataworksstudio
@dataworksstudio 8 месяцев назад
Bro very good explanation but I have doubt...I followed all your steps but I saw instead of 229 task 217 are being created for me and 4 stages...and in the Spark job I don't see (7+5) task for repartioning the dataframes and hence 2 stages are also missing...any idead why? Thank you.
@easewithdata
@easewithdata 8 месяцев назад
Hello, Please disable AQE and Broadcast join, to replicate the same behaviour. I did it just to explain how thing work in background. With Spark 3 AQE is enabled by default.
@alishkumarmanvar7163
@alishkumarmanvar7163 2 месяца назад
.
@pawansharma-pz1hz
@pawansharma-pz1hz 5 месяцев назад
Thanks for detailed explanation, I have one doubt, df_union = df_sum.union(df_4) after this step, why its showing 229 task again in Job DAG
@easewithdata
@easewithdata 5 месяцев назад
Yes that would be skipped stages
Далее
19 Understand and Optimize Shuffle in Spark
15:14
Просмотров 2,1 тыс.
14 Read, Parse or Flatten JSON data
17:50
Просмотров 2,4 тыс.
DSPy Explained!
54:16
Просмотров 57 тыс.
I gave 127 interviews. Top 5 Algorithms they asked me.
8:36
24 Fix Skewness and Spillage with Salting in Spark
21:17
12 Understand Spark UI, Read CSV Files and Read Modes
17:08
The ONLY PySpark Tutorial You Will Ever Need.
17:21
Просмотров 129 тыс.
20 Data Caching in Spark
13:19
Просмотров 1,6 тыс.
Don't Use Apache Airflow
16:21
Просмотров 91 тыс.