No video :(

Everyday I'm Shuffling - Tips for Writing Better Apache Spark Programs

Databricks

Подписаться 114 тыс.

Просмотров 27 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

21 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 20

@stackologycentral 9 лет назад

Thanks Holden and Vida. Awesome tips.

@vasukisubbaramu4656 9 лет назад

Thanks Holden. Brilliant lecture. I am looking at some inefficiencies with SPARK joins and this video helped me in solving some. Keep it coming please.

@youdeepujain 9 лет назад

Vasuki Subbaramu Can you share some of those problems and solutions. I am doing joins between large RDDs and they are very slow. Throwing executors is not speeding things up.

@sahebbhattu 7 лет назад

Please let me know if you have implemented anything for your large joins

@sahebbhattu 7 лет назад

Thanks a lot ! As you have explained if there is a huge table joining to a small table , we can use broadcast. But, if I am having 15 joins\tables one after another .....and 3 of them are huge , let's say 1st, 5th and 11th. So, while going from 1st to 4th , i will use broadcast on 2nd, 3rd and 4th . But I can't use it on 5th (else it would be disaster in shuffling of 5th to each node). So how to handle this ? can we tune this in a better way ?

@deeplearningpartnership 3 года назад

Interesting.

@mazenezzeddine8319 8 лет назад

Thanks, Is the number of reduce workers/machines usually equals to the number of distinct keys? how the worker machines are selected out of the available set of workers? what if I want that the reduction to be done on single machine say the driver?

@igor.berman 9 лет назад

Anybody knows how to do filter of Big Rdd before joining BigRdd with MediumRdd(described as better solution at 12:57)

@sahebbhattu 7 лет назад

Actually, she meant to say that, if there is a big RDD try filtering it before doing the costlier join operation (as per the basic spark join principle) so we can split a dataframe into many dataframes , e.g. filter the big dataframe date wise and loop the operations ....so it will be safe and faster......If there is no option to filter the dataframe...then I guess there is nothing we can do ...YET...

@kali786516 8 лет назад

DIY approach to write to database cant I just use df.write.mode(SaveMode.Append).jdbc(....) , is using spark inbuilt JDBC connection approach not a good practise ? which is the best approach to write to database from spark rdd ?

@PatrickHulce 8 лет назад

I'm with +Igor Berman. The "solution" she described is exactly the problem. If I could already filter down the "all the world" to just Californians I wouldn't need the join in the first place...anyone know what she's talking about with a filter transform here?

@desalgelijk 8 лет назад

+Patrick Hulce It is assumed that he two RDD's contain a different set of columns, otherwise there would be no point in joining.

@PatrickHulce 8 лет назад

I see my assumption was that the California RDD was some a subset of fields like just ID and the goal was to filter down the world to those records. It seems there's no shortcut for that though.

@desalgelijk 8 лет назад

+Patrick Hulce I watched the part again and it does seem like she filters the world RDD by looking at the ID's in the CA RDD. Not sure how that would work without shuffle..

@PatrickHulce 8 лет назад

+desalgelijk that makes three of us now then :) haha it just doesn't seem possible on the surface.

@koudelkaa 8 лет назад

+Patrick Hulce I found a blog post here: fdahms.com/2015/10/04/writing-efficient-spark-jobs/ discussing the whys and hows of this trick is that the key space of the medium sized RDD fits in memory and can be broadcasted. In my case my medium sized RDD does not fit in memory and my left outer join has the large RDD on the left so a filter is useless anyways. I wish spark had a built in KV store to support lookup based joins. I know it is possible to do by using an external kv store, but it is annoying and adds maintenance.