say if we want to process 1 tb data with a given cluster capacity in your example 1. when we may get OOM (executory) issue 2. when we will not get OOM issue 3. how spark can do sort merge shuffle join (500gb per df, 2 dfs) 4. briefly explain, how come spark handles big data without OOM issues and when it may get OOM with examples along with code
total num of executors will be calculated using the formulae - total cores/number of cores per executor. Then this number will be divided by total nodes in the cluster to get the number of executors per node. Here 15 is for total number of executors.
Hi Mahesh, the memory parameters are set at cluster level. Please check this video for processing large files ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-rBxRBk2ZkVk.html
Spark internal mechanism performs all the memory optimizations. We can update the configuration for explicitly applying memory configurations which will inturn does optimizations.
Hi, in any version , Spark is capable of calculating memory as per the tasks list. If we want to customize based on data then we need to provide these custom calculated values. Even in Spark 3, it provides default values, but its always better to check the UI, analyse the jobs to see that there si no data skewness, no delays etc and we can use the custom configuration in spark-submit.
Thank you for watching the video. Driver is the spark program where you have created the spark context or spark session. Application master is the process which negotiates the resources with resource manager about the executors. The driver memory mentioned as example. You can provide the driver memory as per your requirement. In some cases, driver memory will be kept as same as executor memory also. Its depend on the cluster memory and the resource sharing.
if we request a container of 4gb then we are actually requesting 4gb(heap memory) + max(384,10% of 4gb)[off heap memory] out off the 4gb (total heap memory) 300mb reserved for running executers. 4096-300=3796(3.7gb) out of this 3.7 gb, 60% of it goes to unified( storage+ execution memory ). 2.3 gb is for (storage+execution) remaining 40 % of 3.7gb goes to user memory(i.e. 1.4 gb) I am not able relate with your calculation kindly help me. I have checked multiple video but still not able to understand. how do I calculate cluster memory. kindly help
Yes, 4gb will be divided for execution+ cache+ overhead memory. As 4 GB is very small in number we see less memory allocation. Think from production cluster perspective, where we will have memory in TBs. In this case, we can see that enough memory is available for job execution and cache memory to Store the intermediate results.
No problem, thank you for asking the questions. When you are looking based on the input data, you need to think about parallelism. Hdfs block size, and no of partitions to be counted. So for calculating the executor memory, leave 1 core abs 1gb per node to yarn. In the remaining memory, calculate the per executor memory. Executor cores are 5 per cpu. So calculate the memory as, how much per code you can divide and then multiply that with 10(file memory). Example, if you have 5 nodes and each 15 cores. So total you will have 85 cores and let's say 64gb memory. So for 5 nodes total memory is - 64*5=320gb. Leave 1 core and 2 gb to yarn. So we will have 70 cores and 310gb memory (for all 5 nodes). So the remaining resources will be - 310/70=4.42. We will have 5 executor cores, so 5*4.42 = 22.14 gb executor memory. I put data will be divided as partitions so we can calculate executor memory in this way.