I don't know how to express my gratitude to you. I went through lot of other youtube videos but none of them explained DAG execution like you. I have become a fan. It is worth watching your video and recommend to everyone
I love your explanations, DAG is very important for spark optimization stuff. I am a big follwer of your corses on Udemy and other platforms. Please keep sharing with us such amazing videos.
Awesome. You should record other videos talking about what shuffle read and shuffle write mean (how to interpret them) and a little bit about how to understand the state of spark (mainly in the dag). I think this kind of information is missing in the Spark Documentation (or at least is pretty hidden). Keep doing videos like these! Thx
Great video! Visiting your channel first time, glad that you tube showed this video in suggestion. I already know basics of spark learned few things from here and there, wondering might be missing many things. do you have spark for beginner course??
Awesome explanation for Spark DAGs in a layman language. Had a follow-up question - how the number of tasks ( Partitions ) are decided for the other stages which were present in the DAG i.e. you did talk about the stages which had 7 or say 9 tasks and we were able to identify those using the transformation i.e. repartition(7) or repartition(9) you did BUT what about the other stages. How the tasks are calculated for those stages ?
There are various defaults that Spark uses. For example, parallelize uses the number of virtual cores as the number of partitions (in the absence of other config). Other operations use 200 partitions as a default, etc.
Hi Daniel, before explaining the join step you mentioned that the smaller table is broadcasted across the executors. Then why there is a exchange due to the join?
Excellent video and explanations One comment - in 14:18 the join seems to happen on stage 6 with the broadcasted dataframe from the previous job. The shuffle between stage 6 and 7 is for collecting the local sums of each partition and to sum them up in order to get the final result. The 413 bytes of the Shuffle Write on stage 6 is an indication. Is that correct?
Hi Daniel, Thanks for the great Video. I have been following you on udemy. I was really looking out for the DAG explanation. Thanks a ton !! Also, I had a request if you can guide for any resource wherein I can learn about spark DAGs more.
I'll create some more material in time - for now, the best resources I have are in the Spark Optimization courses on Rock the JVM: rockthejvm.com/p/spark-optimization
Hi Daniel - incredible video.. i work with Spark SQL queries and often find the need to optimize the poor performing sql queries by analyzing details in Spark UI. Are there any of your resources which discuss about the same in details ? If yes, then could you please point me in the right direction. Would love to hear more from you. Thanks
Yes - I have long-form, in-depth courses on Spark performance at rockthejvm: rockthejvm.com/p/spark-optimization rockthejvm.com/p/spark-performance-tuning
@@rockthejvm I took a look at the courses.. But I think they discuss in details with Scala which I don't know. Do you have a course to performance tuning Sql queries which generic in nature E.g select, Joins, Where clauses etc ?
@@rockthejvm Hi Daniel I am looking forward to purchase the "spark-optimization" course. Could you please let me know which one of these 2 courses discuss extensively about SparkSQL query optimizations, reading query plans, join optimization and interpreting Spark UI (rewriting queries etc).Thanks
thank you so much ..nice explanation... you are doing "sum" instead can you do "count" of records after aggregation ... when we use "count" only one tast we see i.e. one partition...its killling performace....how to handle/repartition the data in such scenario?
It's not killing performance - that aggregation works in the same way, as partial aggregations are computed per partition before being collapsed into one value.
Is it possible to capture the DAGs while a job is running for determining the average job completion time? This will be helpful to fine-tune the job scheduler to reduce the average job completion time.
Hey Daniel, great content!! I was wondering if you can lay-out the order in which you would advise someone coming from a Python background to take your courses and if there are any supplementary reading materials that you would advise to read in-between courses. Thanks in advance :)