Ease With Data

86
124 361

9:32

Get 100% more Interview calls from Naukri Portal | Boost your Naukri Profile |Optimize Naukri Search

День назад

34:15

28 Get Started with Delta Lake using Databricks | Benefits and Features of Delta Lake | Time Travel

Месяц назад

21:17

27 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases

2 месяца назад

21:17

17 Read and Write from Azure Cosmos DB using Spark | E2E Cosmos DB setup | NoSQL vs SQL Databases

2 месяца назад

13:37

01 What is Distributed Computing, Big Data and Hadoop? | History of Distributed File System

2 месяца назад

10:08

16 Late Data Processing | Watermarks | Tumbling and Sliding Window Operations in Spark Streaming

3 месяца назад

7:32

15 Tumbling, Sliding and Session Window Operations in Spark Streaming | Grouped Window Aggregations

3 месяца назад

4:35

14 Spark Streaming Event vs Processing Time | Late Arrival of Data | Stateful Processing |Watermarks

3 месяца назад

16:53

13 Spark Streaming Handling Errors and Exceptions | Handle Exception for data re-processing in Spark

3 месяца назад

11:19

12 Spark Streaming Writing data to Multiple Sinks | foreachBatch | Writing data to JDBC(Postgres)

4 месяца назад

10:31

11 Spark Streaming Triggers - Once, Processing Time & Continuous | Tune Kafka Streaming Performance

4 месяца назад

10:41

10 Spark Streaming Read from Kafka | Real time streaming from Kafka

4 месяца назад

14:37

09 Apache Kafka Basics & Architecture | Kafka Tutorial | Pub Sub Architecture | Learn Kafka in 15min

4 месяца назад

8:44

08 Spark Streaming Checkpoint Directory | Contents of Checkpoint Directory

4 месяца назад

14:26

07 Spark Streaming Read from Files | Flatten JSON data

4 месяца назад

3:06

06 Lambda and Kappa Architectures | Data Processing Architectures in Big Data

5 месяцев назад

12:32

05 Spark Streaming Output Modes, Optimization and Background

5 месяцев назад

13:15

04 Spark Streaming Read from Sockets | Convert Batch Code to Streaming Code

5 месяцев назад

7:07

03 Spark Streaming Local Environment Setup - Docker, Jupyter, PySpark and Kafka

5 месяцев назад

6:46

02 How Spark Streaming Works

5 месяцев назад

1:32

01 Spark Streaming with PySpark - Agenda

5 месяцев назад

19:20

26 Spark SQL, Hints, Spark Catalog and Metastore

6 месяцев назад

11:52

25 AQE aka Adaptive Query Execution in Spark

6 месяцев назад

21:17

24 Fix Skewness and Spillage with Salting in Spark

6 месяцев назад

10:30

23 Static vs Dynamic Resource Allocation in Spark

6 месяцев назад

28:17

22 Optimize Joins in Spark & Understand Bucketing for Faster joins

6 месяцев назад

12:35

21 Broadcast Variable and Accumulators in Spark

6 месяцев назад

13:19

20 Data Caching in Spark

6 месяцев назад

15:14

19 Understand and Optimize Shuffle in Spark

6 месяцев назад

Комментарии

@vipulsarode2722 6 часов назад

Hello, what if I wanted to do this in VSCode instead of Jupyter Notebook with docker as shown in the video?

@IswaryaPydimarri 15 часов назад

What is invited to apply in naukri?and how to reply?

@Kevin-nt4eb 22 часа назад

so in deployement mode the driver program is submitted inside a executer which is present inside a cluster. am I rignt?

@anveshkonda8334 День назад

Thanks a lot for sharing. It will be very helpful if you add data directory in git hub repo

@SharadSonwane-xk1ht День назад

Great 👍

@easewithdata День назад

Thank you ❤️ Please make sure to share with your network over LinkedIn 👍

@mohammadaftab7002 3 дня назад

thanks for this valuable insight, expecting the same video for apache iceberg and hudi in future

@easewithdata День назад

Sure and Thank you ❤️ Please make sure to share with your network over LinkedIn 👍

@SonuKumar-fn1gn 4 дня назад

Very nice video

@easewithdata День назад

Thank you ❤️ Please make sure to share with your network over LinkedIn 👍

@irannamented9296 6 дней назад

need to understand one thing why yyyy and dd not in capital letter is there any reason for that

@easewithdata 5 дней назад

Spark follows the following datetime pattern format (mostly resembles to Unix formats) spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

@ruantenorio4442 7 дней назад

Thanks for your lessons! You covered all the gaps between the main concepts.

@easewithdata 5 дней назад

Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️

@RxLocum 7 дней назад

Thanks for such a detailed work. You're a Hero.

@easewithdata 5 дней назад

Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️

@sumitrawall 7 дней назад

Sir can u make a video to setup spark and hadoop with docker

@easewithdata 5 дней назад

For Spark setup Hadoop is not mandatory. You can have Spark standalone setup using docker. But if you still want the same, you can clone and follow the steps from below Github repo github.com/Marcel-Jan/docker-hadoop-spark

@sumitrawall 7 дней назад

Can a fresher become data engineer?

@easewithdata 5 дней назад

Yes absolutely, please start with SQL, Atleast one programming language and Spark

@Aman-lv2ee 7 дней назад

please make a video on creating resume for senior data engineers and please share the template thanks

@easewithdata 5 дней назад

Will definitely try. Thanks.

@Learn2Share786 8 дней назад

Thanks, pls share the senior data engineer resume template.. will help

@easewithdata 8 дней назад

Sure will try to share the same.

@shivakant4698 9 дней назад

spark's standalone cluster is where on docker or any where please tell me my cluster execution codes are not running why?

@easewithdata 8 дней назад

Standalone cluster used in this tutorial is on docker. You can set it up yourself. For notebook - hub.docker.com/r/jupyter/pyspark-notebook You can use the below docker file to setup cluster github.com/subhamkharwal/docker-images/tree/master/spark-cluster-new

@rakeshpanigrahi577 9 дней назад

Thanks bro :)

@easewithdata 8 дней назад

Thanks, Please make sure to share with your network over LinkedIn ❤️

@user-br6oe3kf9k 11 дней назад

how to contact you sir

@easewithdata 10 дней назад

You can connect with me over topmate topmate.io/subham_khandelwal/

@user-br6oe3kf9k 6 дней назад

@@easewithdata Hi sir will you be available today please since I dont have time till weekend

@akshaykadam1260 11 дней назад

great work

@easewithdata 10 дней назад

Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍

@ComedyXRoad 12 дней назад

thank you for your efforts

@easewithdata 10 дней назад

Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍

@ComedyXRoad 13 дней назад

thanks for your efforts it helps lot

@easewithdata 12 дней назад

Thanks ❤️ Please make sure to share with your network over LinkedIn 🛜

@sushantashow000 13 дней назад

can accumulator variables be used to calculate avg as well? as when we are calculating the sum it can do for each executors but average wont work in the same way.

@easewithdata 12 дней назад

Hello Sushant, To calculate avg, the simplest approach is to use two variables one for sum and another for count. Later you can divide the sum with count to get the avg. If you like the content, please make sure to share with your network 🛜

@ComedyXRoad 14 дней назад

thank you in real time do we use cluster node or cline mode which you are using now?

@easewithdata 12 дней назад

I am using the client mode

@shivakant4698 16 дней назад

localhost:4040 is not working when I done ".master("spark://e75727ddf432:7077")" how can be solved?

@Amarjeet-fb3lk 16 дней назад

Why you made 32 shuffle partition if you have 8core, If one partition is going to process on single core, from where it will get other remaining 24 cores?

@easewithdata 12 дней назад

The 8 cores will process all the 32 partitions in 4 iterations each. (8X4 = 32)

@shivakant4698 17 дней назад

when I am refreshing my spark ui is giving error how can be solved giving this "spark://6b16b66805db:7077" and on "localhost:4040" also not working I give this".master("spark://6b16b66805db:7077")" how can be solved please.

@SanthoshKumar-sl7zc 18 дней назад

Thanks for the Explanation, Very useful

@easewithdata 12 дней назад

Glad it was helpful! Please make sure to share with your network over LinkedIn ❤️

@vaibhavkumar38 19 дней назад

From the vidoe : Select, where, group by etc ate transformations. We have narrow transformation and wide transformation. Wide transformation are those when data has to move or interact with data of other partitions in next stages

@vaibhavkumar38 19 дней назад

Again from video itself: executors are jvm processes, 1 core can do 1 task at a time, above pic we have 6 cores, so 6 tasks were possible

@vaibhavkumar38 19 дней назад

Shuffle is the boundary which divides job into stages

@vaibhavkumar38 19 дней назад

Great explanation.. liked the illustration that 2 counts happened and the fact that after local count and before global count, some shuffling happened

@easewithdata 19 дней назад

Thanks 👍 Please make sure to share with your Network over LinkedIn ❤️

@Bijuthtt 21 день назад

Awesome. Super explanation . I love it

@easewithdata 21 день назад

Thanks, Please share with your Network over LinkedIn ❤️

@sovikguhabiswas8838 21 день назад

why are we not using withCloumn instead of expr?

@easewithdata 21 день назад

Just to show all possible options. withColumn is also used in later videos.

@anveshkonda8334 22 дня назад

You are awesome bro.. Thanks a lot

@easewithdata 22 дня назад

Glad to hear that ☺️ Please make sure to share with your network over LinkedIn ❤️

@deepanshuaggarwal7042 22 дня назад

What is the use case of slide duration in streaming app... any real world example ?

@easewithdata 21 день назад

Its basically used for cumulative aggregations

@Bijuthtt 23 дня назад

You are awesome man. I was trying to setup spark and kafka in docker for long time. done today. thank you very much

@easewithdata 23 дня назад

Glad I could help. Please make to share with your network over LinkedIn ❤️

@deepanshuaggarwal7042 23 дня назад

Hi, I have a doubt: How can we check if a stream has multiple sink from spark UI?

@easewithdata 23 дня назад

Allow me sometime to search the exact screenshot for you.

@DataEngineerPratik 23 дня назад

what if both the tables are very small like one is 5 MB and other is 9 MB then which df is broadcasted across executor?

@easewithdata 23 дня назад

In that case it doesn't matter, however AQE always prefer to broadcast the smaller table.

@DataEngineerPratik 22 дня назад

@@easewithdata Thanks & I'm following you for more than a month its been a great learning experience , we want you to make End to End Project in Pyspark

@nishantsoni9330 24 дня назад

one of the best explanation in depth, Thanks :) Could you please make a video on "end to end Data engineering" project, from requirement gathering to the deployment.

@easewithdata 24 дня назад

Thanks ❤️ Please make sure to share with your network on LinkedIn 🛜

@hamedtamadon6520 26 дней назад

Hello, and thanks for the sharing of these useful videos. How to handle the writing in delta tables: Because the best practice is that the size of each parquet file should be between 128 MB to 1 GB. How to handle this situation while each batch has very less than the size that is mentioned? or how to handle to collect the number of batches and to reach the mentioned size and finally to write in deltalake.

@easewithdata 24 дня назад

Usually microbatch execution in Spark can write multiple small files. This requires a later stage to read all those files and write a compacted file (say for each day) of bigger size to avoid small file issue. You can use this compacted file to read data in your downstream systems.

@hamedtamadon6520 26 дней назад

Hi and thanks I have a question: What is the best practice in distributed mode for Spark? Is it good that the number of kafka partitions and spark executors are equal?

@easewithdata 24 дня назад

Ideally yes, but you can let your Spark cluster run in auto scaling mode, which will adjust the same as per the requirement.

@hamedtamadon6520 24 дня назад

@@easewithdata Thanks for your swift response❤🙏

@revathinp4551 27 дней назад

For me clearSource and sourceArchive is not working ,files are not getting archived and archive folder is not getting created.whta colud be the issues?

@easewithdata 24 дня назад

Please check this link to check if you are setting all parameters as per requirement - spark.apache.org/docs/latest/structured-streaming-programming-guide.html

@priyachaturvedi1164 28 дней назад

How do pyspark handles consumer group and consumer of kafka , generally by converting consumer group we can start consuming data from starting of topic , how to start consuming from beginning, startingoffset as earliest , will always read from starting .

@easewithdata 24 дня назад

Spark Streaming utilizes checkpoint directory to handle offsets, so that it doesn't consume the offset which is already consumed. You can find more details to set the starting offset with Kafka at this link - spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

@gulfrazahmed684 Месяц назад

I have implemented watermarking using tumbling window, of 5 minutes to aggregate values, triggering 5 minutes and used update mode, Problem is that I have got late data again, how it should be fixed. ///////////////////////////////////////////////////////////////////////////////////////////// df = df.withWatermark("open_time", "5 minutes") # Group data into 5-minute candlesticks by symbol candlestick_df = df \ .groupBy("symbol", window("open_time", "5 minutes", startTime=0)) \ .agg( expr("first(open_price) as open_price"), expr("max(high_price) as high_price"), expr("min(low_price) as low_price"), expr("last(close_price) as close_price"), expr("sum(volume) as volume") ) query_kafka = candlestick_df \ .selectExpr("CAST(symbol AS STRING) AS key", "to_json(struct(*)) AS value") \ .writeStream \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \ .option("topic", output_topic) \ .option("checkpointLocation", checkpoint_dir) \ .outputMode("update") \ .option("truncate", "false") \ .trigger(processingTime='5 minutes') \ .start()

@tonee84 Месяц назад

Thanks for the video, but can you shred some lights which cases we prefer Window functions to "traditional" expr?

@easewithdata Месяц назад

Hello, Both does the same work, you can use them anytime as per your convenience.

@nikschuetz4112 Месяц назад

if you can cast it to a string, why not able to cast it to json, or map it out and use a json parser?

@easewithdata Месяц назад

Yes you can definitely cast as per your wish.

@nikschuetz4112 Месяц назад

you can also just alias the struct and then select * from it. for example exploded_df.select(“data_devices.*”) and it gives all the columns

@ansumansatpathy3923 Месяц назад

Why is there a shuffle write for the read stage from files to a dataframe? Does that involve a shuffle? Also a shuffle write for only kbs worth data?

@easewithdata Месяц назад

Shuffle is only involved when the next step is a wide operation. And for KBs data, it depends on the next stage. If you have count it will first make a local count before shuffling the data (which is reduced to kbs)

@rakeshpanigrahi577 Месяц назад

Thanks, buddy. explode_outer is also an important concept.

@easewithdata Месяц назад

Yes definitely it is. Bus its impossible to cover all functions on RU-vid. You can definitely checkout Spark documentation for more information. This series was meant to enable people with zero or basic knowledge to get started

@irenebinoy4002 Месяц назад

Great explanations & effort taken. I have encountered an Py4JJavaError while executing the df_agg.writeStream.format("console").outputMode("complete").start().awaitTermination command.Could you explain why?

@easewithdata Месяц назад

please check the logs to debug the issue.

@souravnandy8385 Месяц назад

It's really great, I do have a question. What if the JSON is 5 to 6 level nested struct. How can we make it to tabular format?