These are some of the best optimizations I have seen. These kind of optimization can only come with deep understanding of spark internals + lots of experience. Kudos to the speaker! Much appreciated!
In my program anti join or not exist condition within the same table dataset is creating broadcasthashjoin and it is doing nested loop join. I have tried cache and repartition but every time it is hitting the broadcast threshold of 8 gb. Even disabling broadcast threshold using set conf of spark does not seems to work. Can you please suggest some solution.
Actually its tricky to explain the whole scenario but the take away from this video would be enabling cbo and analyzing table just before the anti join. The program is written in pyspark. But any suggestions around efficiently dealing with anti joins or not exists with corelated sub query ( which actually breaks down to a join) would be of great help