No video :(

repartition vs coalesce | Lec-12

Подписаться 21 тыс.

Просмотров 18 тыс.

50% 1

In this video I have talked about repartition vs coalesce in spark. If you want to optimize your process in Spark then you should have a solid understanding of this concept.
Directly connect with me on:- topmate.io/man...
Flight Data link:- github.com/dat...
For more queries reach out to me on my below social media handle.
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj

Опубликовано:

22 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 56

@dataman17 4 месяца назад

You make the subject interesting. Best channel in Data engineering. Thank you for the videos. Looking forward to more!

@vishaljare1284 3 месяца назад

After purchasing a 25k course.I recently came to your RU-vid channel I realised that your free course is worth than this.

@AkshayBaishander Месяц назад

Which course you took?

@user-qn6ud4hs3b 4 месяца назад

I got so many of my doubts cleared from your videos, they are put together so well, they are very easy to understand.

@vishaljare1284 3 месяца назад

I like your teaching style. The way you are explaining is excellent.keep going bro

@ankitachauhan6084 2 месяца назад

very good teaching style with clearity thanks !

@rahuljain8001 Год назад

HI Manish, I have tried the same, its actually partitons are not remove, we should use partitioned_on_column.withColumn("partition_id",spark_partition_id()).groupBy("partition_id").count().orderBy("partition_id").show(300) to check the partions

@abhishekrajput4012 6 дней назад

Thank you Manish Bhaiya

@prabhatgupta6415 Год назад

Godfather of SPARK

@muskangupta735 3 месяца назад

Great explanation

@ajaypatil1881 9 месяцев назад

Another great video Bhaiya♥

@omkarm7865 Год назад

Great work

@sakshijain5503 3 месяца назад

Hello Sir, What is the difference between repartition and BucketBy? Thank You!

@ManishSharma-fi2vr 3 месяца назад

❤ Thanks Manish Bhaiya!!

@vishenraja 2 месяца назад

Hi Manish, As, you mention that in repartition data will be evenly distributed. so, if best-selling product distributed among multiple partition, then how join will work as for join same key should be on same partition. Could you please explain this?

@kartikjaiswal8923 Месяц назад

nice explanation

@SqlMastery-fq8rq 5 месяцев назад

well Explained Sir

@raghavsisters Год назад

I think there will be method to find how much optimal partition we can do. So if I have large data set then it’s difficult to try partition size and time for each partition .

@DpIndia 11 месяцев назад

Nice tutorial, all clear

@engineerbaaniya4846 Год назад

Awesome content

@alokkumarmohanty8454 Год назад

Hi Manish, Thanks for all your videos. I personally got to know so many thing from these videos. I have a doubt here For any give instance how we will decide the no. of partition for both repartition and coalesce. I mean repartition(10). How we decide the no.-10 for exapmle

@user-lp3qe9jj3m Год назад

Can you please explain the repartition and coalesce with dstaframe joining realtime example so we can see the real time optimization of joining process

@TaherAhmed16 11 месяцев назад

Well Explained..!! One question. As we discussed in the earlier sessions that the rdds are immutable, So when we do a repartition or coalesce, the old RDD with imbalanced data also still exists on the executor nodes along with the new repartitioned data? If yes then at what point it gets cleared, as it will keep increasing the disk on the executor nodes? Should we do that manually in the code?

@TaherAhmed16 11 месяцев назад

I was looking at your executor out of memory video, so looks like all RDDs will be there on the executor but they will be spilled on the disk based on LRU.

@manish_kumar_1 11 месяцев назад

Aapka understanding thora galat hai. Same data is referenced everytime. Aisa nahi hai ki data ek baar aur copy hoga. When you hit the action same data is refernced. DAG se usko pata chalta hai ki which one to pick. LRU only cached data ko evict karte rahta hai

@TaherAhmed16 11 месяцев назад

@@manish_kumar_1 Got it, thanks .

@rushikesh6496 Год назад

We used coalesce/repartition to decrease the number of partitions and if the newly created partition's size is more than executor memory then what will happen? Will it spill data to disk or cause OOM?

@Useracwqbrazy Год назад

Repartition happens on primary memory i.e. ram...so i think it should give OOM

@sanooosai 5 месяцев назад

great sir thank you

@user-rh1hr5cc1r 4 месяца назад

great

@pde.joggiri 10 месяцев назад

doubt 1: Repartition(1) vs Coalesce(1) is there any diffrence? Which one should we use when writing as single file. doubt 2: I was reading multiple csv files(6) into dataframe then I write with coalesce(1) again overwrite with coalesce(10). It is givin 6 partitions. why partition size increased with coalsce().

@surabhisasmal882 10 месяцев назад

Great video Manish, very informative. Recently I was asked if we have 200 partitions , would we prefer repartition(1) or coalesce(1) . Any insights pls?

@TheSmartTrendTrader 10 месяцев назад

Repartition(1) and Coalesce(1) both outputs single partition however from a performance point of view I would prefer using Coalesce because it simply merges all input partitions into one without shuffling and repartition would do exchange partitioning under the hood and takes slightly more time than coalesce. You can look at the explain plan of both and it will be visible over there.

@akumar2575. 4 месяца назад

day 2 done 👍

@saumyasingh9620 Год назад

Well explained! Thanks...keep posting..... I have been asked on reduceByKey in some interview, that also please explain in some session. I am not clear whether we can use it with dataframe or only rdd is required to apply it. Please comment.

@soumyaranjanrout2843 7 месяцев назад

Directly we can't perform reduceByKey on Dataframe. We need to convert it to rdd then we can apply the reduceByKey. df.rdd.reduceByKey(anonymous function...)

@vishalmane3139 Год назад

Bro ye jo interview questions different companies k daalé hai tumne, agar tumhra DE ka roadmap follow krenge, then will we be able to answer those interview questions of different companies?

@udittiwari8420 6 месяцев назад

thankyou sir

@raghavsisters Год назад

Do you have git page for code?

@akhilesh2186 Год назад

You are second Khan Sir of Bihar. Just a suggestion to add some jokes in between to keep it lighter. Unfortunately your Views are not that much as there are no more audience but I bet no one can explain better than you. I guess if you will make same video in English as well then you will have more subscribers as well as viewers. I am from Hindi/Bhojpuri region (Varanasi) and like your videos a lot but whenever I am looking at your views and subscribers then thinking if I can increase them by any way that's why giving the suggestion.

@poojajoshi871 Год назад

Hi Sir, in withcolumn line we are adding partionid as column but how we are putting the value in that column as no literal is being introduced also can you please explain on spark_partition_id(). y are we using

@user-ks7wl9pc2f Год назад

WithColumn se hum ek naya column add karte hain. aur spark_partition_id() is bulit in method available in spark.sql.functions

@RahulPatil-iu2sp Год назад

Hi Manish sir, If we are processing a 1TB file on a 10 nodes cluster (64 GB RAM each), then will it get processed or throw an OOM error? Could you please explain this?

@manish_kumar_1 Год назад

In short, It will run. Memory management is vast topic and explaining OOM is not simple at least in comment. I will make dedicated video on this topic. Stay connected with the channel

@RahulPatil-iu2sp Год назад

Thanks for the clarification @Manish Sir. I'll stay tuned👍👍

@kalyanreddy496 Год назад

Consider you have read 1GB file into a dataframe. The max partition bytes configuration is set to 128MB. you have applied the repartition(4) or coalesce (4) on the dataframe any of the methods will decrease the number of partitions.If you apply the repartition(4) or coalesce (4) Partition size gets increase >128MB . but the max Partition bytes is configured to 128MB. Does it throws any error (or) not throws any error? If it throws an error what is the error we will get when we execute the program? If not what is the behaviour of spark in this scenario? Could you tell me the answer for this question sir. Recently I faced this question. Requesting you sir please

@manish_kumar_1 Год назад

Your partition will be of bigger size. Around 250 MB each in deserialzed form. But when you will write as parquet then snappy compression will reduce the size. Let say after reducing it becomes 150 mb then total size after compression will become 159*4=600mb. If you try to read again after writing then it will read into 600/128=5 partitions

@soumyaranjanrout2843 7 месяцев назад

Spark will not throw an error and it will go against the configuration.It will create 4 partitions regardless of maxPartitionBytes and now each partition will hold 1GB/4 = 250(approximately) as Manish sir told. But it can lead the performance degradation. Moreover maxPartitionBytes is configurable so if we work with larger dataset then we can configure it as per the use case scenario. By the way thanks for your question. Just because of your question I get to know some other stuff other than the video.

@user-ks7wl9pc2f Год назад

according to our data agar hum correct partition karte hain toh kya hamara execution time decrease hota hai kya??

@manish_kumar_1 Год назад

Yes

@user-ks7wl9pc2f Год назад

@@manish_kumar_1 Thank you for the confirmation. Manish i have watched your all videos theory and practical its really awesome. you have explained in simple way that makes it very worth 🙏❤

@rp-zf3ci Год назад

@manish please explain bucketing concept in spark

@manish_kumar_1 Год назад

Already did

@rp-zf3ci Год назад

@@manish_kumar_1 as per that video, you mentioned 5 buckets will be created after repartition(5). But I think it should be 5*5 = 25 buckets. 5 buckets for each task. Please correct me if I'm wrong. Thanks.