Thank you Bro!! your videos are very informative and helpful. Can you please one video explaining setting up spark in local machine. That will be very helpful
Thanks @jdisunil for the kind words. There's already an in-depth video on AQE. You can refer here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-bRjVa7MgsBM.html
Hey @subaruhassufferredenough7892, Thanks you for the kind words, really appreciate it :) On Spark SQL, DAGs/Execution plans for both Spark SQL and non-SQL (python) are the same as they are compiled/optimized by the same underlying engine/catalyst optimizer.
At 16:49, as part of the AQE plan for the larger dataset, the way that I understood is 1 skewed partition was split in 12 and finally we had 24+12 = 36 partitions. We see the same on Job Id 9 at 13:40 that it had 36 tasks. But I heard you say that 36 partitions have been reduced to 24. Can you please help clear the confusion ? thank you.
I think in that AQE Step, AQEShuffleRead reads 200 partitions (as per previous node) from customers dataset, then coalesced to 24 then something happened and make them to 36 thats why that right side node is showing "number of of partitions 36". At left side for transactions dataset, this "number of of partitions 36" is appearing as last value where at right side for customers dataset its appearing as first value. But Im not sure what is that " something"???
Hi Afaque Ahmad At 13:37 you were saying that separate job for shuffle operation that one job for transactions dataset shuffle operation and one for customers dataset. Im bit confused why they need a separate job? As per my understanding, when spark encounters a shuffle operation, it just creates a new stage within that job right? When I execute the same code snippet, it create 5 jobs totally: two for metadata (expected), two for shuffle operation (not expected) and final one is for join operation. Many thanks
Hi Afaque Ahmad At 7:24, you were saying that a batch is a group of rows and its not same as a partition. Shall we assume something like a group of rows read from one or more partitions available in one or more executors (not from all executors) to match that df.show() count?
Hello Bro, I have a doubt. at "23:30 min" playtime, it was mentioned that AQEShuffleRead: coalesced partitions into 1, then will the other worker nodes will sit ideal ? In the Video it is mentioned that even after shuffle, all A's will be in 1 partition and B's in another partition. can you please explain me, what do you actually mean by Number of Coalesced Partitions=1
By default, shuffle partitions are 200, hence you see that in the 'Exchange' step. The reduction (optimization) to fewer partitions takes place in the 'AQEShuffleRead' step below.