No video :(

Spark Join Without Shuffle | Spark Interview Question

Подписаться 9 тыс.

Просмотров 21 тыс.

50% 1

#Spark #Join #Internals #Performance #optimization #DeepDive #Join #Shuffle: In this video , We have discussed how to perform the join without the shuffle.
Please join as a member in my channel to get additional benefits like materials in BigData , Data Science, live streaming for Members and many more
Click here to subscribe : / @techwithviresh
About us:
We are a technology consulting and training providers, specializes in the technology areas like : Machine Learning,AI,Spark,Big Data,Nosql, graph DB,Cassandra and Hadoop ecosystem.
Mastering Spark : • Spark Scenario Based I...
Mastering Hive : • Mastering Hive Tutoria...
Spark Interview Questions : • Cache vs Persist | Spa...
Mastering Hadoop : • Hadoop Tutorial | Map ...
Visit us :
Email: techwithviresh@gmail.com
Facebook : / tech-greens
Twitter :
Thanks for watching
Please Subscribe!!! Like, share and comment!!!!

Опубликовано:

21 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 29

@shivrajsingh5559 3 года назад

That's what i was looking for. It's a great help Viresh

@mrkrish501 4 года назад

i m really happy with your in deep dive spark. Thank you.

@gemini_537 3 года назад

small2 is not defined. Also why is the shuffle cost of partitioning the 2 RDDs separately lower than the shuffle cost of joining them directly? They are basically doing the same thing, moving data of the same join key to a same executor.

@gemini_537 3 года назад

I feel the title is misleading, repartitioning the 2 RDDs involves shuffle.

@MohitKumar-st3ms 4 года назад

Let's say if you are having two large dataframe , then How will you optimize the join ? And why are you using the rdd as it's very slow as compared to dataframe ?

@monku1821 3 года назад

have been following the series, its pretty good but this video is not at all clear, you should make another with same question

@naveenkumar-tb1de 4 года назад

I have been asked like, if I have 2 tables with same volume of data but say one has 10 column and other has 3 columns, how to optimise this joining.

@keyaar3393 3 года назад

shuffle during join OR doing repartition before join .... u r saying that the second one is better.... right? Whats the difference? u have not mentioned why is it better... some one has to take care of repartitioning -> either join will shuffle or we have to repartition -> its fine... pls let us know why this approach is better.

@gemini_537 3 года назад

What's the benefit of persisting the 2 RDDs?

@Trip-Train Год назад

Why are you converting dataframe to rdd ?? It is very bad practice in terms of performance

@SpiritOfIndiaaa 4 года назад

thanks Veresh , here "rdd"s been used , how to do same using Dataset/Dataframe ?? where you got "small2" from??

@shankargs7685 4 года назад

partition.get is returning None in largeRDD line no. 14

@adamantnams 4 года назад

Any suggestions for dataframes?

@IndianCoupleinUKBLR 4 года назад

where did small2 came from .....there is typo mistakes...can you please update it.??

@SpiritOfIndiaaa 4 года назад

really nice , thanks bro , in line 14 , is it "small.partition.get" instead "small2.partition.get" right ? why shuffle.partitions set to 2 only ?

@TechWithViresh 4 года назад

Otherwise remaining 198 partitions would be empty

@SpiritOfIndiaaa 4 года назад

@@TechWithViresh is it otherwise or other words ? want to keep 198 partitions empty ?

@rohinirithe1522 4 года назад

getting error for line number 14 ---> error: value partitioner is not a member of org.apache.spark.sql.DataFrame Kindly suggest

@sagarrawal7740 7 месяцев назад

Video recommendatin at the end are blocking the content...

@rishigc 4 года назад

Even with repartitioning we have to move data to different partitions causing a shuffle, isnt it ?

@Mryajivramuk 3 года назад

Concept is really worth testing. Code is incomplete at places . I took time to fill gaps. Last line display()..will it work in scala spark ?🙄

@TechWithViresh 3 года назад

This code will run fine on Azure Databricks.

@saurabhgarud6690 3 года назад

Very Nice content provided on this channel thanks for that, Q:- Can range partition work here ?

@gemini_537 3 года назад

What's the book/picture in the video?

@dipanjansaha6824 4 года назад

How to connect with you?

@TechWithViresh 4 года назад

TechWithViresh@gmail.com

@dheerendrakumarjain6672 2 года назад

your example is not up to the mark, whatever you describe in your lecture it is not understandable, only the shake of creating a video you do this, I did not get your point whatever you told us regarding the join how it happens and what happens please describe in a much better understandable manner.