Spark Out of Memory Issue | Spark Memory Tuning | Spark Memory Management | Part 1

Подписаться 29 тыс.

Просмотров 125 тыс.

50% 1

This video is part of the Spark Interview Questions Series.
Spark Memory issues are one of most common problems faced by developers. so Suring spark interviews, This is one of very common interview questions. In this video we will cover ffollowing
What is Memory issue in spark
What components can face Out of memory issue in spark
Out of memory issue in Driver
out of memory issue in Executor
How Spark's performance is impacted by Dynamic Partition Pruning
Here are a few Links useful for you
Git Repo: github.com/harjeet88/
Spark Interview Questions: • Spark Interview Questions
If you are interested to join our community. Please join the following groups
Telegram: t.me/bigdata_hkr
Whatsapp: chat.whatsapp.com/KKUmcOGNiix...
You can drop me an email for any queries at
aforalgo@gmail.com
#apachespark #sparktutorial #bigdata
#spark #hadoop #spark3

Опубликовано:

15 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 94

@Nonamaee 3 года назад

So well explained, even the images were very useful. Thank you very much!

@rijwanmohammed1309 3 года назад

Great please don't stop from uploading new contents!!

@minalmoon4605 3 года назад

It is a great vedio. Content is very useful. Keep it up man 👍🏻👍🏻👍🏻

@ravikumarkumashi7065 3 года назад

very well expained, thank you

@vijeandran 3 года назад

Neatly explained thank you....

@PrasadNadiger456 2 года назад

Great video.. perfect explanation

@nikhilmishra7572 3 года назад

recently discovered this channel. this is gold

@DataSavvy 3 года назад

Thanks Nikhil :)

@nivedita5639 3 года назад

True

@kaladharnaidusompalyam851 3 года назад

Thank you so much. I m facing many times this auestion recent days. 👍

@DataSavvy 3 года назад

Thanks :)

@prasadadsul8703 Год назад

Great information...... 👏👏👏

@saivarunkolluru6792 3 года назад

Lots of respect for ur content ❤️

@DataSavvy 3 года назад

Thanks mate

@lxkakkarot3689 2 года назад

Can you please also show code to repartition and increase executor on dummy process by changing values so that you can show us the impact on the run time of the jobs ? That will be really great to understand concepts

@bhuvaneshkumarsrivastava906 3 года назад

Is the 2nd Part not there yet? Your videos are AWSUUMMM !!! :D

@NishaKumari-op2ek 3 года назад

Very useful videos. Thank you :)

@DataSavvy 3 года назад

Thanks Nisha

@rohithsaivemula3200 2 года назад

Very helpful

@RamRam-jp2kc 3 года назад

Your videos on Trouble shooting are pretty good.

@DataSavvy 3 года назад

Thanks Sree Ram... :)

@nikhilgupta110 2 года назад

Pure content, great topic, informative, interactive and simple.Thanks you!!

@viraajsivaraju2329 3 года назад

Very useful.please keep making more such videos

@DataSavvy 3 года назад

Thanks Viraaj :)

@sarfarazhussain6883 3 года назад

Waiting for Part 2 :)

@DataSavvy 3 года назад

Will come soon :)

@ajaykiranchundi9979 2 года назад

Very nice video!! thank you

@DataSavvy 2 года назад

Thanks Ajay

@praptijoshi9102 3 месяца назад

amazing

@riyasmohammad9234 2 года назад

Great video. Can you share the source of information for further reading?

@anuragamit727 2 года назад

Hi Sir, Could you please make a video on the factors that decide the number of tasks, stages, and jobs created after submitting our application.

@bhatiaparesh89 3 года назад

Waiting for part 2! :🙈

@DataSavvy 3 года назад

Working on it... Will post in few weeks. I need to explain one related concept first before that video

@RAKESHKUMAR-tp8zj 3 года назад

U are one of the best mentor I have ever seen on youtube. The way you explain in awesome and all real-time questions. if my cluster memory is 10 GB and the date we want to process is 20 Gb will it process the data? sir can you please explain this topic

@medotop330 3 года назад

No you can not process it

@medotop330 3 года назад

You can do it using MapReduce if it is in batch layer or non used iterative algorithms like machine learning algos

@sambitkumardash9585 3 года назад

Nice video Sir. And mostly asked question in interview . Could you please make one video, related to other issues we do face in Spark .

@DataSavvy 3 года назад

Sure Sambit... Do u have any other suggestion on questions?

@sambitkumardash9585 3 года назад

@@DataSavvy could you please explain , how to deal with the semi structured data, from ingestion to computation .

@nakkaeswaraoeswar2140 Год назад

Thank you . Can you make video about what is Azure Sql?

@PrasadChallagondla 3 года назад

Is there any real-time spark project. Please upload video on it. It would be helpful.

@sundarkris1320 3 года назад

Can you explain me difference between yarn memory over head vs Spark reserved and user memory?

@arvind30 3 года назад

Great video! I had a question regarding the yarn memory overhead. When a pyspark job runs, my understanding is that python worker processes are started within the memory allocated to the executor. JVM then sends data back and forth to these python processes. Won't the allocated python objects use the memory of these python processes instead of the yarn memory overhead?

@Fresh-sh2gc 2 года назад

the worker nodes run based on resources of the yarn memory. Yarn is normally run on a shared cluster thus there always a tug of war between the tenants of the cluster for memory. as a result, one cannot always use too much memory. However, when there is ample yarn memory there is a process called preemption which gets more memory for the executor memory,

@krupab3388 2 года назад

can you please give example of each OOM what you have explained here, lots of blogs are given with same explanations. what extra is here. please provide with examples. it would be great.

@ravikirantuduru1061 3 года назад

Good videos

@DataSavvy 3 года назад

Thanks Ravi :)

@RAVIC3200 3 года назад

Nice Video again Harjeet :) , Hey Can you make videos on Test cases on spark/scala as well, i have scene no one talk about it.

@DataSavvy 3 года назад

Hi Ravi, test cases are generally about functional and use case specific...

@rajlakshmipatil4415 3 года назад

Ravishankar Maybe you can try using holdenkarau

@DataSavvy 3 года назад

Thanks for suggesting... Looks like a good resource... I will go through this github.com/holdenk/spark-testing-base

@suresh.suthar.24 2 месяца назад

i have one doubt: reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals. Thank you for your time.

@Fresh-sh2gc 2 года назад

Spark on kubernetes works completely different. This works only for spark on hadoop.

@aashishraina2831 3 года назад

Recruiters say that you dont have production experience and POC spark working will not help. How can we convince despite having a good understanding of PYspark. Plz sugget

@naveena2226 2 месяца назад

Hi @all I just got to know about the wonderful videos in datasavvy channel. In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue? Can Simeon please explain this? Little confused here Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??

@subimalkhatua2886 3 года назад

Issue : container killed by yarn . Spark application Exited 1. This is the most common in aws glue or any spark jobs . increasing spark.yarn.executor.memoryOverhead and spark.yarn.executor.memory willl help but make sure it shouldn't increase than the total yarn.nodemanger memory or else there'll be a issue of configuration.

@ANUKARTHIM 3 года назад

Dear Data Savvy, Could you please clarify, if we go for broadcast join mean, it copies the small file into all available executor memory right? how come it causes the driver out of memory exception.

@DataSavvy 3 года назад

That file is first brought on driver and merged(if it has multiple partitions) then it is sent to executors

@ANUKARTHIM 3 года назад

@@DataSavvy Thanks for the answer

@DataSavvy 3 года назад

Thanks

@svsvikky 3 года назад

@@DataSavvy Isn't brodcast done executor-executor similar to bittorrent? Please correct me if i am wrong

@saisravankumar6020 2 года назад

When loading a file to data frame you get oom error, how will u rectify it? Can we get a demo?

@kiranmudradi26 3 года назад

Nice Video. Question: In case when we call coalesce(1), does it causes any OOM issues either in driver or executor? if calling this operation does not through any OOM what could be the reason? Please clarify.

@DataSavvy 3 года назад

U are right... Coalesce can also cause memory breach in few situations...

@kiranmudradi26 3 года назад

@@DataSavvy Thanks. In that case OOM will happen at executor side not at driver side. is my understanding correct?

@DataSavvy 3 года назад

Yes...

@DataSavvy 3 года назад

Wait... A correction here... Repartition (1) can cause issue , not coalesce (1) as coalesce will not cause shuffle and data will stay on same machines...

@kiranmudradi26 3 года назад

@@DataSavvyThanks. i was about to ask the same question. u replied in time. Kudos

@carlosllerena3922 3 года назад

question if i use pyspark do i still get does errors ?? another question in instead of collect what other command ca we use

@vikaschavan6118 2 года назад

SaveAsFile instead of collect

@touristplaces7837 10 месяцев назад

Hello. I have 16 crore records on which i want to use window function. But order by is taking huge time and giving memory issue. is there any alternative approach

@user-dl3ck6ym4r 6 месяцев назад

how would we know that which file is small and which file is larger . one interview asked this question to me.

@DataSavvy 5 месяцев назад

You can list the files in folder and see the size of file... Hdfs fs -ls... This is command

@user-dl3ck6ym4r 5 месяцев назад

thank you@@DataSavvy

@user-dl3ck6ym4r 5 месяцев назад

but i am using s3 bucket so@@DataSavvy

@vishalmishra863 3 года назад

Where is the second part ?

@divit00 10 месяцев назад

Part 2??

@k.saibhargav8072 6 месяцев назад

How to avoid collect operation

@DataSavvy 6 месяцев назад

You usually don't need collect.. Can you give an example where you are using it.. I can suggest, how to avoid it and rightly code

@krunalgoswami4654 2 года назад

Why use rdd in all question?? Why not dataframe?

@sreenivasmekala6198 3 года назад

Is groupbykey also cause of Out of Memory Right

@DataSavvy 3 года назад

U are right... If there is skewness in data...vin case of group by key, we can end up facing Memory issue

@amitpadhi2717 3 года назад

i cant able to join your whatsapp group i am facing some issue in my local machine while setting up spark; please let me know where to post my query

@DataSavvy 3 года назад

Please join telegram group and send query there... We have moved to telegram... Http://t.me/bigdata_hkr

@amitpadhi2717 3 года назад

@@DataSavvy aforalgo@gmail.com dropped a mail already could you please check the issue which i faced

@rahulpandit9082 3 года назад

Who is the person who dislikes this video... I think.. frustrating with life or wife... 😀😀😀

@DataSavvy 3 года назад

Ha ha ha 😀

@midhileshmomidi2434 3 года назад

I am learning concepts but without real time experience I am not able to get practice on Data Collection from various sources I am able to clean the data well using Pyspark and can do ML using Spark ML by MLlib library But please suggest some sources to practice for Data Collection from various sources Thank you

@DataSavvy 3 года назад

Sure, let me look into this and I will share some link... You can join our document library and data Savvy group... U will get lot of relevent information there