No video :(

Spark Memory Management | How to calculate the cluster Memory in Spark

Подписаться 5 тыс.

Просмотров 12 тыс.

50% 1

Hi Friends,
In this video, I have explained how to calculate the Spark cluster memory.
Please subscribe to my channel for more interesting learnings.

Опубликовано:

21 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 26

@venukumargadiparthy8233 2 месяца назад

Much needed for interviews, Thanks for sharing sravana.

@Sharath_NK98 6 месяцев назад

Well explained Laxmi.

@aratithakare8016 2 года назад

very good explanation of concepts

@sravanalakshmipisupati6533 2 года назад

Thanks a lot.

@heenagirdher6443 Год назад

Thank you so much for this video. Very well explained. Can you please make more videos related to interview questions on this topic.

@sravanalakshmipisupati6533 Год назад

Thank you. Sure Heena.

@RaXiUs007 6 дней назад

a small correction i think --num-executors is across all nodes in cluster so it should not be 3 it should be 3 * 5 = 15

@hamidkureshi6722 Год назад

A great explanation! Thank you so much.

@sravanalakshmipisupati6533 Год назад

Thanks a lot.

@universaltv6798 Год назад

say if we want to process 1 tb data with a given cluster capacity in your example 1. when we may get OOM (executory) issue 2. when we will not get OOM issue 3. how spark can do sort merge shuffle join (500gb per df, 2 dfs) 4. briefly explain, how come spark handles big data without OOM issues and when it may get OOM with examples along with code

@sarukavi1007 2 года назад

👏

@leedsshri 2 года назад

-num executor should be 15 right ? We need to give number of executors to be used in total and not per node. Pls correct me if I am wrong.

@sravanalakshmipisupati6533 2 года назад

total num of executors will be calculated using the formulae - total cores/number of cores per executor. Then this number will be divided by total nodes in the cluster to get the number of executors per node. Here 15 is for total number of executors.

@HollyJollyTolly 2 года назад

Even if we have 1GB input data shall I consider same parameters

@sravanalakshmipisupati6533 2 года назад

Hi Mahesh, the memory parameters are set at cluster level. Please check this video for processing large files ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-rBxRBk2ZkVk.html

@udaynayak4788 2 года назад

can you please name those tools which handle the memory optimization?

@sravanalakshmipisupati6533 2 года назад

Spark internal mechanism performs all the memory optimizations. We can update the configuration for explicitly applying memory configurations which will inturn does optimizations.

@manjunathbn9513 Год назад

In Spark 3.0 onwards, these calculations are done by default by spark right?

@sravanalakshmipisupati6533 Год назад

Hi, in any version , Spark is capable of calculating memory as per the tasks list. If we want to customize based on data then we need to provide these custom calculated values. Even in Spark 3, it provides default values, but its always better to check the UI, analyse the jobs to see that there si no data skewness, no delays etc and we can use the custom configuration in spark-submit.

@antonyjesu7698 2 года назад

Kindly can u say different between application master and driver? Y r we giving less memory(2gb) to driver?

@sravanalakshmipisupati6533 2 года назад

Thank you for watching the video. Driver is the spark program where you have created the spark context or spark session. Application master is the process which negotiates the resources with resource manager about the executors. The driver memory mentioned as example. You can provide the driver memory as per your requirement. In some cases, driver memory will be kept as same as executor memory also. Its depend on the cluster memory and the resource sharing.

@user-jy1wv5nt5d Год назад

if we request a container of 4gb then we are actually requesting 4gb(heap memory) + max(384,10% of 4gb)[off heap memory] out off the 4gb (total heap memory) 300mb reserved for running executers. 4096-300=3796(3.7gb) out of this 3.7 gb, 60% of it goes to unified( storage+ execution memory ). 2.3 gb is for (storage+execution) remaining 40 % of 3.7gb goes to user memory(i.e. 1.4 gb) I am not able relate with your calculation kindly help me. I have checked multiple video but still not able to understand. how do I calculate cluster memory. kindly help

@sravanalakshmipisupati6533 Год назад

Yes, 4gb will be divided for execution+ cache+ overhead memory. As 4 GB is very small in number we see less memory allocation. Think from production cluster perspective, where we will have memory in TBs. In this case, we can see that enough memory is available for job execution and cache memory to Store the intermediate results.

@antonyjesu7698 2 года назад

Sorry to ask..again.... suppose the question is raw process data is 10gb. Based on raw data 10 gb how to calculate memory? Plz

@sravanalakshmipisupati6533 2 года назад

No problem, thank you for asking the questions. When you are looking based on the input data, you need to think about parallelism. Hdfs block size, and no of partitions to be counted. So for calculating the executor memory, leave 1 core abs 1gb per node to yarn. In the remaining memory, calculate the per executor memory. Executor cores are 5 per cpu. So calculate the memory as, how much per code you can divide and then multiply that with 10(file memory). Example, if you have 5 nodes and each 15 cores. So total you will have 85 cores and let's say 64gb memory. So for 5 nodes total memory is - 64*5=320gb. Leave 1 core and 2 gb to yarn. So we will have 70 cores and 310gb memory (for all 5 nodes). So the remaining resources will be - 310/70=4.42. We will have 5 executor cores, so 5*4.42 = 22.14 gb executor memory. I put data will be divided as partitions so we can calculate executor memory in this way.

@antonyjesu7698 2 года назад

Thanks.