Тёмный
Data Savvy
Data Savvy
Data Savvy
Подписаться
We cover all the topics in Big Data world. As part of this channel, we intend to bring videos related to big data, hadoop, spark no sql databases, hbase, cassandra, machine learning deep learning etc.

We also help students in free interview preparation. Please reach us without any hesitation.

You can Join Our Whatsapp Group and Telegram Group To Discuss and Prepare Interview questions.
Stay Updated with all Material on Github
If You want to connect with me, connect on Linkedin Please
Комментарии
@jayantmeshram7370
@jayantmeshram7370 9 дней назад
I am trying to run your code in ubuntu system but not able to .. could you please make a video how to run your code in ubuntu/linux .. thanks
@anudeepk7390
@anudeepk7390 11 дней назад
Did the participant consent for posting this online? If not u should blur his face
@DataSavvy
@DataSavvy 11 дней назад
Yes.. It was agreed with participants
@roopashastri9908
@roopashastri9908 11 дней назад
How can we orchestrate this in airflow?what should be the schedule interval
@DeepakNanaware-ze8pq
@DeepakNanaware-ze8pq 20 дней назад
Hey bro, your videos is good but if you are creating video of how to handle small file issue using practical way so it's very helpful for us.
@khrest-lt6gj
@khrest-lt6gj 21 день назад
I really like the pace of the videos. Great job! And thank you!
@biswadeeppatra1726
@biswadeeppatra1726 Месяц назад
Please share the doc that you are using in this video
@VivekKBangaru
@VivekKBangaru Месяц назад
very informative one. Thanks Buddy.
@akashhudge5735
@akashhudge5735 Месяц назад
In lambda architecture so far no one has explained how de duplication is handled when the batch and stream processing data is combined in serving layer? whatever data is processed by streaming layer will eventually gets processed in batch layer? if this is true then previous streaming layer processed data is no more required. so do we need to remove that data processed by streaming layer?
@jasbirkumar7770
@jasbirkumar7770 Месяц назад
sir can you tell me some about housekeeping executive spark deta. i dont understand spark word. facility company JLL requird he have spark exprience
@deepanshuaggarwal7042
@deepanshuaggarwal7042 Месяц назад
"flatMapGroupsWithState" is a statefull operation? Do you have any tutorial on it?
@thelazitech
@thelazitech 2 месяца назад
To whom may be concerned when to use GroupByKey over ReduceByKey: groupByKey() can be used for non-associative operations, where the order of application of the operation matters. For example, if we want to calculate the median of a set of values for each key, we cannot use reduceByKey(), since median is not an associative operation.
@sreekantha2010
@sreekantha2010 2 месяца назад
Awesome!! wonderful explanation. Before this, I have see so many videos but none of those explained the steps in such a clarity. Thank you sharing.
@BishalKarki-pe8hs
@BishalKarki-pe8hs 2 месяца назад
vak mugi
@ldk6853
@ldk6853 2 месяца назад
Terrible accent… 😮
@maturinagababu98
@maturinagababu98 3 месяца назад
Hi sir pls help me with the following requirement id|count| +---+-----+ | a| 3| | b| 2| | c| 4| +---+-----+ need the following output using spark a a a b b c c c c
@ramyajyothi8697
@ramyajyothi8697 3 месяца назад
What do you mean by application needing a lot of joins? Can you please clarify how the joins are affecting the architecture decision?
@suresh.suthar.24
@suresh.suthar.24 3 месяца назад
i have one doubt: reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals. Thank you for your time.
@ahmedaly6999
@ahmedaly6999 3 месяца назад
how i join small table with big table but i want to fetch all the data in small table like the small table is 100k record and large table is 1 milion record df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin') it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help
@naveena2226
@naveena2226 4 месяца назад
Hi @all I just got to know about the wonderful videos in datasavvy channel. In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue? Can Simeon please explain this? Little confused here Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??
@sheshkumar8502
@sheshkumar8502 4 месяца назад
Hi how are you
@praptijoshi9102
@praptijoshi9102 4 месяца назад
amazing
@adityakvs3529
@adityakvs3529 4 месяца назад
I have who take care of task sheduling
@kaladharnaidusompalyam851
@kaladharnaidusompalyam851 4 месяца назад
If we maintain replica of data in three diff racks in hadoop. if we submit job we get results right. hy we dont get copies of data execution. how can / what is the operation that is there in hadoop only one block of data ned to process in hadoop if we have two more duplictaes
@prathapganesh7021
@prathapganesh7021 4 месяца назад
Great content thank you
@RakeshYadav-cw3zf
@RakeshYadav-cw3zf 5 месяцев назад
very well explained
@ayushigupta542
@ayushigupta542 5 месяцев назад
Great content! Are you on Topmate or any other platform where I can connect with you. Need some career advice/guidance from you.
@Pratik0917
@Pratik0917 5 месяцев назад
Then people arent using dataset everywhere?
@user-du9wb1oe7t
@user-du9wb1oe7t 5 месяцев назад
Hi Harjeet, Getting Kafka utils not found error while creating dstream
@harshitsingh9842
@harshitsingh9842 5 месяцев назад
where is the volume?
@harshitsingh9842
@harshitsingh9842 5 месяцев назад
Having a diff table at the end of the video would be appreciated.
@pandurangbhadange25
@pandurangbhadange25 5 месяцев назад
2. The cache method is used to persist the DataFrame or RDD in memory by default. It is a shorthand for calling persist() with the default storage level, which is MEMORY_ONLY 3. The persist method allows you to specify a storage level for persisting the DataFrame or RDD. This storage level can include options such as MEMORY_ONLY, MEMORY_ONLY_SER, DISK_ONLY, MEMORY_AND_DISK, etc.
@pandurangbhadange25
@pandurangbhadange25 5 месяцев назад
Map - 1. Support only for RDD not for the Dataframe 2. Operation on the all the row 3. mapPartitions() - heavy initialization executes only once for each partition instead of every record ( row )
@pandurangbhadange25
@pandurangbhadange25 5 месяцев назад
1. Parsing - create an abstract syntax tree. 2. Analysis - Catalyst analyzer performs semantic analysis on tree. This includes resolving references, type checking, and creating a logical plan. The analyzer also infers data types. 3. Logical optimisation - Rewrite the plan into a more efficient form. This includes predicate pushdown, constant folding. 4. Physical planning - Spark stages and tasks created. 5. Physical optimisation - optimized further by considering factors like data partitioning, join order, and choosing the most efficient physical operators 6.Code Generation - generates Java bytecode for the optimized physical plan
@pandurangbhadange25
@pandurangbhadange25 5 месяцев назад
groupBy in PySpark: 1. transformation that groups the elements of a DataFrame or RDD based on the specified key or keys. 2. does not perform any aggregation on the grouped data. 3. non-key-value pair data. - Use group by key reduceByKey in PySpark: 1. operation is specifically designed for key-value pair RDDs. It groups the data by key 2.performs a reduction or aggregation operation 3. Key value pair - Use the group by key
@pandurangbhadange25
@pandurangbhadange25 5 месяцев назад
repartition: 1. is used to increase or decrease the RDD/DataFrame partitions 2. More shuffle Coalesce : 2. Reduce the partition 2. No shuffle 3. Less expensive
@pandurangbhadange25
@pandurangbhadange25 5 месяцев назад
1. master-slave architecture and master is called the “Driver” and slaves are called “Workers” 2. context that is an entry point to your application
@pandurangbhadange25
@pandurangbhadange25 5 месяцев назад
1. Transformations are operations on RDDs or DataFrames that create a new RDD or DataFrame from an existing one. 2. lazily evaluated, meaning the execution is deferred until an action is called. 3. No return result. Action : 1. Actions are operations that return a value to the driver program
@pandurangbhadange25
@pandurangbhadange25 5 месяцев назад
1. pyspark.SparkContext is an entry point to the PySpark functionality that is used to communicate with the cluster and to create an RDD 2. one SparkContext per JVM 3. in order to create another first you need to stop the existing one using stop() method 4. PySpark shell creates and provides sc object, which is an instance of SparkContext class. 5. Creating a SparkSession creates a SparkContext internally and exposes the sparkContext variable to use.
@pandurangbhadange25
@pandurangbhadange25 5 месяцев назад
Lineage - Logical plan Dag - Physical plan
@pankajchikhalwale8769
@pankajchikhalwale8769 5 месяцев назад
Great explanation. Excellent teaching. Please do a deeper dive into coalesce and partition.Command and several scenarios (e.g. coalesce after repartition, re-partition after coalesce, coalesce after coalesce etc). I know that some of these may not be meaningful but I have seen several of your videos and you are GREAT teacher.
@rakeshchaudhary8255
@rakeshchaudhary8255 5 месяцев назад
still relevent as of today and frequently asked. The practical on databricks made things crystal clear.
@RAVIKUMAR74
@RAVIKUMAR74 5 месяцев назад
how to decide calculate no of executor and memory when multiple jobs are runing by multiple proresses in cluster? what is the best way to tune our process. is this good way to give 50 executor with 16gb each. could you please explain with example.
@neerajbhanottt
@neerajbhanottt 5 месяцев назад
Do watermarking solves late arriving records along with flatMapGroupWithState for complex scenerios in databricks?
@clydeschultz1687
@clydeschultz1687 6 месяцев назад
💕 'promosm'
@prajaktadange6916
@prajaktadange6916 6 месяцев назад
please add video for optimization in spark .and how to moniter performance in spark UI
@anshusharaf2019
@anshusharaf2019 6 месяцев назад
In this scenario-based question can we create an end-to-end pipeline using the Kafka and power BI dashboard like..we can connect with your database as a source connector and for the transformation we can use KSQL DB where we perform some business-level transformation and after that store it into the Kafka-topic and then connect with the power BI for dashboard? @dataSavvy or someone, can u check Am I right thinking?
@pramodp8161
@pramodp8161 6 месяцев назад
That music is good and painful....
@DataSavvy
@DataSavvy 6 месяцев назад
Feedback noted... This is changed in all next videos
@mydreamprisha
@mydreamprisha 6 месяцев назад
Parquet ki spelling ye hai..
@shaleenarora2894
@shaleenarora2894 6 месяцев назад
Very clearly explained. Thank you!
@DataSavvy
@DataSavvy 6 месяцев назад
Thanks @shaleen
@Adi300594
@Adi300594 6 месяцев назад
Why is it called lambda architecture? isn't is same as an ETL/ELT workflow. Is it only the nature of source data(stream/event) that makes it a lambda architecture? My question is more around the etymology of this architecture, an architecture on Databricks platform like lakehouse can work on the discussed usecases?