No video :(

18 Understand DAG, Explain Plans & Spark Shuffle with Tasks

Ease With Data

Подписаться 4,6 тыс.

Просмотров 2,9 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

22 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 33

@sureshraina321 9 месяцев назад

Omg this is serious stuff , I'm sure no online tutors are teaching this much depth and I'm glad that I found your channel 😍

@easewithdata 9 месяцев назад

Thank you 💗 Make sure to share with your network and tag us.

@sarthaks 7 месяцев назад

Very thorough , detailed explanation of spark internals.... never seen such content before.. Nice job!!

@easewithdata 7 месяцев назад

Much appreciated!

@sarthaks 7 месяцев назад

I have two follow-up questions: 1- How to avoid shuffle. Does shuffle really impact the performance in all scenarios. 2- Given a DAG or say the explain plan, what are the areas or steps that one needs to take care for performing performance optimization. @easewithdata

@easewithdata 7 месяцев назад

1. Yes shuffle impacts performance, but we cannot avoid shuffle in all cases (consider aggregations). We need to optimize those so that the step behave perfectly. 2. There are many steps for optimization of a JOB in Spark. The first step starts with Explain Plan/DAG. It only provide the way spark is going to execute the code. But each step might have some different bottlenecks that you need to look out for (e.g. skewness/spillage)

@chandrasekharreddy3617 9 месяцев назад

Perfect explanation of DAG in detail with easy to understand manner. Thanks, Now I get more confidence than earlier.

@easewithdata 8 месяцев назад

Glad it was helpful!

@aashisharora3536 4 месяца назад

why didn't I read your blog earlier and landed to your channel, your explanation is superb man, please keep posting videos

@easewithdata 4 месяца назад

Thank you, please make sure to share with your network as well.

@Ravi_Teja_Padala_tAlKs 7 месяцев назад

Seriously Super bro. Keep going and Thanks for all this 🎉

@easewithdata 7 месяцев назад

Thanks 👍 Make sure to share with your network as well ☺️

@ravulapallivenkatagurnadha9605 7 месяцев назад

Great explanation Please continue uploading videos.

@easewithdata 7 месяцев назад

Please check the playlist for next videos.

@ashishsahu4025 4 месяца назад

Nice work bro

@ateetagrawal9928 9 месяцев назад

Very very informative video

@easewithdata 9 месяцев назад

Thank you, please make sure to share with your network 🛜

@ComedyXRoad Месяц назад

thank you in real time do we use cluster node or cline mode which you are using now?

@easewithdata Месяц назад

I am using the client mode

@rakeshpanigrahi577 2 месяца назад

Thanks for the awesome explanation. I ran the exact code in Databricks, but it skipped the repartitioning step somehow. It is not showing the relevant steps for repartitioning in the DAG or in the explain plan.

@easewithdata 2 месяца назад

In order to replicate the same behaviour just disable the AQE

@ansumansatpathy3923 2 месяца назад

Why is there a shuffle write for the read stage from files to a dataframe? Does that involve a shuffle? Also a shuffle write for only kbs worth data?

@easewithdata 2 месяца назад

Shuffle is only involved when the next step is a wide operation. And for KBs data, it depends on the next stage. If you have count it will first make a local count before shuffling the data (which is reduced to kbs)

@at-cv9ky 6 месяцев назад

since the default parallelism is 8, only 8 tasks can run in parallel. So can you explain how 200 tasks in the join transformation ran in parallel ?

@easewithdata 6 месяцев назад

All 200 task didnt run in parallel. Batches of 8 task will run one after another untill all 200 tasks are completed

@at-cv9ky 6 месяцев назад

@@easewithdata thanks. if possible kindly make a series on Databricks as well. Just curious to understand how it integrates with Spark.

@anupb.a983 7 месяцев назад

Doubt : As sum is also needs shuffling and join also why for sum 200 parittions are not created

@easewithdata 6 месяцев назад

If AQE is enabled you will not find 200 Shuffle partitions. It will coalesce all un necessary shuffle partitions. Checkout - ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-164OKvwW8T8.html

@dataworksstudio 8 месяцев назад

Bro very good explanation but I have doubt...I followed all your steps but I saw instead of 229 task 217 are being created for me and 4 stages...and in the Spark job I don't see (7+5) task for repartioning the dataframes and hence 2 stages are also missing...any idead why? Thank you.

@easewithdata 8 месяцев назад

Hello, Please disable AQE and Broadcast join, to replicate the same behaviour. I did it just to explain how thing work in background. With Spark 3 AQE is enabled by default.