We cover all the topics in Big Data world. As part of this channel, we intend to bring videos related to big data, hadoop, spark no sql databases, hbase, cassandra, machine learning deep learning etc.
We also help students in free interview preparation. Please reach us without any hesitation.
You can Join Our Whatsapp Group and Telegram Group To Discuss and Prepare Interview questions. Stay Updated with all Material on Github If You want to connect with me, connect on Linkedin Please
In lambda architecture so far no one has explained how de duplication is handled when the batch and stream processing data is combined in serving layer? whatever data is processed by streaming layer will eventually gets processed in batch layer? if this is true then previous streaming layer processed data is no more required. so do we need to remove that data processed by streaming layer?
To whom may be concerned when to use GroupByKey over ReduceByKey: groupByKey() can be used for non-associative operations, where the order of application of the operation matters. For example, if we want to calculate the median of a set of values for each key, we cannot use reduceByKey(), since median is not an associative operation.
Hi sir pls help me with the following requirement id|count| +---+-----+ | a| 3| | b| 2| | c| 4| +---+-----+ need the following output using spark a a a b b c c c c
how i join small table with big table but i want to fetch all the data in small table like the small table is 100k record and large table is 1 milion record df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin') it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help
Hi @all I just got to know about the wonderful videos in datasavvy channel. In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue? Can Simeon please explain this? Little confused here Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??
If we maintain replica of data in three diff racks in hadoop. if we submit job we get results right. hy we dont get copies of data execution. how can / what is the operation that is there in hadoop only one block of data ned to process in hadoop if we have two more duplictaes
2. The cache method is used to persist the DataFrame or RDD in memory by default. It is a shorthand for calling persist() with the default storage level, which is MEMORY_ONLY 3. The persist method allows you to specify a storage level for persisting the DataFrame or RDD. This storage level can include options such as MEMORY_ONLY, MEMORY_ONLY_SER, DISK_ONLY, MEMORY_AND_DISK, etc.
Map - 1. Support only for RDD not for the Dataframe 2. Operation on the all the row 3. mapPartitions() - heavy initialization executes only once for each partition instead of every record ( row )
1. Parsing - create an abstract syntax tree. 2. Analysis - Catalyst analyzer performs semantic analysis on tree. This includes resolving references, type checking, and creating a logical plan. The analyzer also infers data types. 3. Logical optimisation - Rewrite the plan into a more efficient form. This includes predicate pushdown, constant folding. 4. Physical planning - Spark stages and tasks created. 5. Physical optimisation - optimized further by considering factors like data partitioning, join order, and choosing the most efficient physical operators 6.Code Generation - generates Java bytecode for the optimized physical plan
groupBy in PySpark: 1. transformation that groups the elements of a DataFrame or RDD based on the specified key or keys. 2. does not perform any aggregation on the grouped data. 3. non-key-value pair data. - Use group by key reduceByKey in PySpark: 1. operation is specifically designed for key-value pair RDDs. It groups the data by key 2.performs a reduction or aggregation operation 3. Key value pair - Use the group by key
repartition: 1. is used to increase or decrease the RDD/DataFrame partitions 2. More shuffle Coalesce : 2. Reduce the partition 2. No shuffle 3. Less expensive
1. Transformations are operations on RDDs or DataFrames that create a new RDD or DataFrame from an existing one. 2. lazily evaluated, meaning the execution is deferred until an action is called. 3. No return result. Action : 1. Actions are operations that return a value to the driver program
1. pyspark.SparkContext is an entry point to the PySpark functionality that is used to communicate with the cluster and to create an RDD 2. one SparkContext per JVM 3. in order to create another first you need to stop the existing one using stop() method 4. PySpark shell creates and provides sc object, which is an instance of SparkContext class. 5. Creating a SparkSession creates a SparkContext internally and exposes the sparkContext variable to use.
Great explanation. Excellent teaching. Please do a deeper dive into coalesce and partition.Command and several scenarios (e.g. coalesce after repartition, re-partition after coalesce, coalesce after coalesce etc). I know that some of these may not be meaningful but I have seen several of your videos and you are GREAT teacher.
how to decide calculate no of executor and memory when multiple jobs are runing by multiple proresses in cluster? what is the best way to tune our process. is this good way to give 50 executor with 16gb each. could you please explain with example.
In this scenario-based question can we create an end-to-end pipeline using the Kafka and power BI dashboard like..we can connect with your database as a source connector and for the transformation we can use KSQL DB where we perform some business-level transformation and after that store it into the Kafka-topic and then connect with the power BI for dashboard? @dataSavvy or someone, can u check Am I right thinking?
Why is it called lambda architecture? isn't is same as an ETL/ELT workflow. Is it only the nature of source data(stream/event) that makes it a lambda architecture? My question is more around the etymology of this architecture, an architecture on Databricks platform like lakehouse can work on the discussed usecases?