Good Explanation, liked it... have couple of question, when you bundle (say jar file) the code, then dp (broadcast map) is also part of the bundle. isn't it ? then db map is available all the worker node where ever bundle is available.. if I dont want to do broadcast, then what would be the code change I need to make? also, bit unclear about the concepts, if someone broadcasting something, then there should be some receiver isn't it? can you help how the worker node receiving it , how it is assigned to broadcasted variable in the worker node ?
Can you please make a video, how to solve group by out of memory issue ? Interviewer asked me how to solve out of memory issue without code changes ? Please explain with code changes and without code changes
Hi can we use broadcast variables as DB connection string, asking because I suppose DB connection is being only used by driver and not by executors but not sure of this..
@@DataSavvy Okie but in one of the review comment which i got is "Broadcast Variables in Spark should only be used to disseminate data to be used by Executors during processing such as lookup tables, hash maps, etc. In some cases Broadcast Variables were being used incorrectly (to store variables only used by the Driver process - such as configuration properties or connection strings). " So this means all executor's DB connection request goes to driver and hence it is not a good idea to use broadcast variable in case of database strings
Okie but in one of the review comment which i got is "Broadcast Variables in Spark should only be used to disseminate data to be used by Executors during processing such as lookup tables, hash maps, etc. In some cases Broadcast Variables were being used incorrectly (to store variables only used by the Driver process - such as configuration properties or connection strings). " So this means all executor's DB connection request goes to driver and hence it is not a good idea to use broadcast variable in case of database strings
I did not get your statement that all db request goes to driver... I have used this approach several times to broadcast db connection to executors and it works perfectly for me... Can you highlight any scenario what I am missing here.. what kind of problem it will cause...
@@DataSavvy Even I am not sure, the code review comment was given by external agency consultant. I have included the same statement (above) in double quotes. I am investigating how much valid are these statement.(Pl refer above statements in double quotes). Is there any official document which can highlight where to store DB connection strings, based on spark internal architecture?
Hi, how can i read the file(one am going to broadcast) from hdfs into spark and create a key,value pair ?? when you get using textFile, it will be RDD[String}...how can i convert this to Map[Key -> Value] ??
Depends on experience and role... In Bangalore a 5 6 year experience guy with spark, is able to get around 25 lpa. If he is from a good college number goes up... Few folks whose base package is less, may get around 18 lpa
@@DataSavvy Thank you, I am a Java developer and looking for upgrading my skills specially in the field of big data. The whole big data ecosystem is kind of overwhelming. How long can it take to master it..if we dedicate a few hrs every day?
I am getting an exception due to an increased size of a dataframe in the pyspark code of my project. The exception is: "org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 35421:6 was 269173286 bytes, which exceeds max allowed: spark.rpc.message.maxSize(268435456 bytes). Consider incresing spark.rpc.message.maxSize or using broadcast variables for large values". The dataframe is getting created but any operation with the same like 'show()' or 'saveAsTable()' is throwing the exception. Now, I solved it by increasing the spark.rpc.message.maxSize, however is it possible to solve the issue by implementing broadcast variables in any way as mention in the exception thrown by spark?
Thanks For Suhhestion Naveen... Unfortunately, RU-vid does not allow me to edit and improve audio quality. Hence stuck with Same Video... This is improved in new Videos