Thanks Holden. Brilliant lecture. I am looking at some inefficiencies with SPARK joins and this video helped me in solving some. Keep it coming please.
Vasuki Subbaramu Can you share some of those problems and solutions. I am doing joins between large RDDs and they are very slow. Throwing executors is not speeding things up.
Thanks a lot ! As you have explained if there is a huge table joining to a small table , we can use broadcast. But, if I am having 15 joins\tables one after another .....and 3 of them are huge , let's say 1st, 5th and 11th. So, while going from 1st to 4th , i will use broadcast on 2nd, 3rd and 4th . But I can't use it on 5th (else it would be disaster in shuffling of 5th to each node). So how to handle this ? can we tune this in a better way ?
Thanks, Is the number of reduce workers/machines usually equals to the number of distinct keys? how the worker machines are selected out of the available set of workers? what if I want that the reduction to be done on single machine say the driver?
Actually, she meant to say that, if there is a big RDD try filtering it before doing the costlier join operation (as per the basic spark join principle) so we can split a dataframe into many dataframes , e.g. filter the big dataframe date wise and loop the operations ....so it will be safe and faster......If there is no option to filter the dataframe...then I guess there is nothing we can do ...YET...
DIY approach to write to database cant I just use df.write.mode(SaveMode.Append).jdbc(....) , is using spark inbuilt JDBC connection approach not a good practise ? which is the best approach to write to database from spark rdd ?
I'm with +Igor Berman. The "solution" she described is exactly the problem. If I could already filter down the "all the world" to just Californians I wouldn't need the join in the first place...anyone know what she's talking about with a filter transform here?
I see my assumption was that the California RDD was some a subset of fields like just ID and the goal was to filter down the world to those records. It seems there's no shortcut for that though.
+Patrick Hulce I watched the part again and it does seem like she filters the world RDD by looking at the ID's in the CA RDD. Not sure how that would work without shuffle..
+Patrick Hulce I found a blog post here: fdahms.com/2015/10/04/writing-efficient-spark-jobs/ discussing the whys and hows of this trick is that the key space of the medium sized RDD fits in memory and can be broadcasted. In my case my medium sized RDD does not fit in memory and my left outer join has the large RDD on the left so a filter is useless anyways. I wish spark had a built in KV store to support lookup based joins. I know it is possible to do by using an external kv store, but it is annoying and adds maintenance.