Тёмный
Ease With Data
Ease With Data
Ease With Data
Подписаться
02 How Spark Streaming Works
6:46
5 месяцев назад
01 Spark Streaming with PySpark - Agenda
1:32
5 месяцев назад
26 Spark SQL, Hints, Spark Catalog and Metastore
19:20
6 месяцев назад
25 AQE aka Adaptive Query Execution in Spark
11:52
6 месяцев назад
21 Broadcast Variable and Accumulators in Spark
12:35
6 месяцев назад
20 Data Caching in Spark
13:19
6 месяцев назад
19 Understand and Optimize Shuffle in Spark
15:14
6 месяцев назад
Комментарии
@vipulsarode2722
@vipulsarode2722 6 часов назад
Hello, what if I wanted to do this in VSCode instead of Jupyter Notebook with docker as shown in the video?
@IswaryaPydimarri
@IswaryaPydimarri 15 часов назад
What is invited to apply in naukri?and how to reply?
@Kevin-nt4eb
@Kevin-nt4eb 22 часа назад
so in deployement mode the driver program is submitted inside a executer which is present inside a cluster. am I rignt?
@anveshkonda8334
@anveshkonda8334 День назад
Thanks a lot for sharing. It will be very helpful if you add data directory in git hub repo
@SharadSonwane-xk1ht
@SharadSonwane-xk1ht День назад
Great 👍
@easewithdata
@easewithdata День назад
Thank you ❤️ Please make sure to share with your network over LinkedIn 👍
@mohammadaftab7002
@mohammadaftab7002 3 дня назад
thanks for this valuable insight, expecting the same video for apache iceberg and hudi in future
@easewithdata
@easewithdata День назад
Sure and Thank you ❤️ Please make sure to share with your network over LinkedIn 👍
@SonuKumar-fn1gn
@SonuKumar-fn1gn 4 дня назад
Very nice video
@easewithdata
@easewithdata День назад
Thank you ❤️ Please make sure to share with your network over LinkedIn 👍
@irannamented9296
@irannamented9296 6 дней назад
need to understand one thing why yyyy and dd not in capital letter is there any reason for that
@easewithdata
@easewithdata 5 дней назад
Spark follows the following datetime pattern format (mostly resembles to Unix formats) spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
@ruantenorio4442
@ruantenorio4442 7 дней назад
Thanks for your lessons! You covered all the gaps between the main concepts.
@easewithdata
@easewithdata 5 дней назад
Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️
@RxLocum
@RxLocum 7 дней назад
Thanks for such a detailed work. You're a Hero.
@easewithdata
@easewithdata 5 дней назад
Thank you so much 😊 Please make sure to share with your network over LinkedIn ❤️
@sumitrawall
@sumitrawall 7 дней назад
Sir can u make a video to setup spark and hadoop with docker
@easewithdata
@easewithdata 5 дней назад
For Spark setup Hadoop is not mandatory. You can have Spark standalone setup using docker. But if you still want the same, you can clone and follow the steps from below Github repo github.com/Marcel-Jan/docker-hadoop-spark
@sumitrawall
@sumitrawall 7 дней назад
Can a fresher become data engineer?
@easewithdata
@easewithdata 5 дней назад
Yes absolutely, please start with SQL, Atleast one programming language and Spark
@Aman-lv2ee
@Aman-lv2ee 7 дней назад
please make a video on creating resume for senior data engineers and please share the template thanks
@easewithdata
@easewithdata 5 дней назад
Will definitely try. Thanks.
@Learn2Share786
@Learn2Share786 8 дней назад
Thanks, pls share the senior data engineer resume template.. will help
@easewithdata
@easewithdata 8 дней назад
Sure will try to share the same.
@shivakant4698
@shivakant4698 9 дней назад
spark's standalone cluster is where on docker or any where please tell me my cluster execution codes are not running why?
@easewithdata
@easewithdata 8 дней назад
Standalone cluster used in this tutorial is on docker. You can set it up yourself. For notebook - hub.docker.com/r/jupyter/pyspark-notebook You can use the below docker file to setup cluster github.com/subhamkharwal/docker-images/tree/master/spark-cluster-new
@rakeshpanigrahi577
@rakeshpanigrahi577 9 дней назад
Thanks bro :)
@easewithdata
@easewithdata 8 дней назад
Thanks, Please make sure to share with your network over LinkedIn ❤️
@user-br6oe3kf9k
@user-br6oe3kf9k 11 дней назад
how to contact you sir
@easewithdata
@easewithdata 10 дней назад
You can connect with me over topmate topmate.io/subham_khandelwal/
@user-br6oe3kf9k
@user-br6oe3kf9k 6 дней назад
@@easewithdata Hi sir will you be available today please since I dont have time till weekend
@akshaykadam1260
@akshaykadam1260 11 дней назад
great work
@easewithdata
@easewithdata 10 дней назад
Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍
@ComedyXRoad
@ComedyXRoad 12 дней назад
thank you for your efforts
@easewithdata
@easewithdata 10 дней назад
Thank you for your feedback 💓 Please make sure to share it with your network over LinkedIn 👍
@ComedyXRoad
@ComedyXRoad 13 дней назад
thanks for your efforts it helps lot
@easewithdata
@easewithdata 12 дней назад
Thanks ❤️ Please make sure to share with your network over LinkedIn 🛜
@sushantashow000
@sushantashow000 13 дней назад
can accumulator variables be used to calculate avg as well? as when we are calculating the sum it can do for each executors but average wont work in the same way.
@easewithdata
@easewithdata 12 дней назад
Hello Sushant, To calculate avg, the simplest approach is to use two variables one for sum and another for count. Later you can divide the sum with count to get the avg. If you like the content, please make sure to share with your network 🛜
@ComedyXRoad
@ComedyXRoad 14 дней назад
thank you in real time do we use cluster node or cline mode which you are using now?
@easewithdata
@easewithdata 12 дней назад
I am using the client mode
@shivakant4698
@shivakant4698 16 дней назад
localhost:4040 is not working when I done ".master("spark://e75727ddf432:7077")" how can be solved?
@Amarjeet-fb3lk
@Amarjeet-fb3lk 16 дней назад
Why you made 32 shuffle partition if you have 8core, If one partition is going to process on single core, from where it will get other remaining 24 cores?
@easewithdata
@easewithdata 12 дней назад
The 8 cores will process all the 32 partitions in 4 iterations each. (8X4 = 32)
@shivakant4698
@shivakant4698 17 дней назад
when I am refreshing my spark ui is giving error how can be solved giving this "spark://6b16b66805db:7077" and on "localhost:4040" also not working I give this".master("spark://6b16b66805db:7077")" how can be solved please.
@SanthoshKumar-sl7zc
@SanthoshKumar-sl7zc 18 дней назад
Thanks for the Explanation, Very useful
@easewithdata
@easewithdata 12 дней назад
Glad it was helpful! Please make sure to share with your network over LinkedIn ❤️
@vaibhavkumar38
@vaibhavkumar38 19 дней назад
From the vidoe : Select, where, group by etc ate transformations. We have narrow transformation and wide transformation. Wide transformation are those when data has to move or interact with data of other partitions in next stages
@vaibhavkumar38
@vaibhavkumar38 19 дней назад
Again from video itself: executors are jvm processes, 1 core can do 1 task at a time, above pic we have 6 cores, so 6 tasks were possible
@vaibhavkumar38
@vaibhavkumar38 19 дней назад
Shuffle is the boundary which divides job into stages
@vaibhavkumar38
@vaibhavkumar38 19 дней назад
Great explanation.. liked the illustration that 2 counts happened and the fact that after local count and before global count, some shuffling happened
@easewithdata
@easewithdata 19 дней назад
Thanks 👍 Please make sure to share with your Network over LinkedIn ❤️
@Bijuthtt
@Bijuthtt 21 день назад
Awesome. Super explanation . I love it
@easewithdata
@easewithdata 21 день назад
Thanks, Please share with your Network over LinkedIn ❤️
@sovikguhabiswas8838
@sovikguhabiswas8838 21 день назад
why are we not using withCloumn instead of expr?
@easewithdata
@easewithdata 21 день назад
Just to show all possible options. withColumn is also used in later videos.
@anveshkonda8334
@anveshkonda8334 22 дня назад
You are awesome bro.. Thanks a lot
@easewithdata
@easewithdata 22 дня назад
Glad to hear that ☺️ Please make sure to share with your network over LinkedIn ❤️
@deepanshuaggarwal7042
@deepanshuaggarwal7042 22 дня назад
What is the use case of slide duration in streaming app... any real world example ?
@easewithdata
@easewithdata 21 день назад
Its basically used for cumulative aggregations
@Bijuthtt
@Bijuthtt 23 дня назад
You are awesome man. I was trying to setup spark and kafka in docker for long time. done today. thank you very much
@easewithdata
@easewithdata 23 дня назад
Glad I could help. Please make to share with your network over LinkedIn ❤️
@deepanshuaggarwal7042
@deepanshuaggarwal7042 23 дня назад
Hi, I have a doubt: How can we check if a stream has multiple sink from spark UI?
@easewithdata
@easewithdata 23 дня назад
Allow me sometime to search the exact screenshot for you.
@DataEngineerPratik
@DataEngineerPratik 23 дня назад
what if both the tables are very small like one is 5 MB and other is 9 MB then which df is broadcasted across executor?
@easewithdata
@easewithdata 23 дня назад
In that case it doesn't matter, however AQE always prefer to broadcast the smaller table.
@DataEngineerPratik
@DataEngineerPratik 22 дня назад
@@easewithdata Thanks & I'm following you for more than a month its been a great learning experience , we want you to make End to End Project in Pyspark
@nishantsoni9330
@nishantsoni9330 24 дня назад
one of the best explanation in depth, Thanks :) Could you please make a video on "end to end Data engineering" project, from requirement gathering to the deployment.
@easewithdata
@easewithdata 24 дня назад
Thanks ❤️ Please make sure to share with your network on LinkedIn 🛜
@hamedtamadon6520
@hamedtamadon6520 26 дней назад
Hello, and thanks for the sharing of these useful videos. How to handle the writing in delta tables: Because the best practice is that the size of each parquet file should be between 128 MB to 1 GB. How to handle this situation while each batch has very less than the size that is mentioned? or how to handle to collect the number of batches and to reach the mentioned size and finally to write in deltalake.
@easewithdata
@easewithdata 24 дня назад
Usually microbatch execution in Spark can write multiple small files. This requires a later stage to read all those files and write a compacted file (say for each day) of bigger size to avoid small file issue. You can use this compacted file to read data in your downstream systems.
@hamedtamadon6520
@hamedtamadon6520 26 дней назад
Hi and thanks I have a question: What is the best practice in distributed mode for Spark? Is it good that the number of kafka partitions and spark executors are equal?
@easewithdata
@easewithdata 24 дня назад
Ideally yes, but you can let your Spark cluster run in auto scaling mode, which will adjust the same as per the requirement.
@hamedtamadon6520
@hamedtamadon6520 24 дня назад
@@easewithdata Thanks for your swift response❤🙏
@revathinp4551
@revathinp4551 27 дней назад
For me clearSource and sourceArchive is not working ,files are not getting archived and archive folder is not getting created.whta colud be the issues?
@easewithdata
@easewithdata 24 дня назад
Please check this link to check if you are setting all parameters as per requirement - spark.apache.org/docs/latest/structured-streaming-programming-guide.html
@priyachaturvedi1164
@priyachaturvedi1164 28 дней назад
How do pyspark handles consumer group and consumer of kafka , generally by converting consumer group we can start consuming data from starting of topic , how to start consuming from beginning, startingoffset as earliest , will always read from starting .
@easewithdata
@easewithdata 24 дня назад
Spark Streaming utilizes checkpoint directory to handle offsets, so that it doesn't consume the offset which is already consumed. You can find more details to set the starting offset with Kafka at this link - spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
@gulfrazahmed684
@gulfrazahmed684 Месяц назад
I have implemented watermarking using tumbling window, of 5 minutes to aggregate values, triggering 5 minutes and used update mode, Problem is that I have got late data again, how it should be fixed. ///////////////////////////////////////////////////////////////////////////////////////////// df = df.withWatermark("open_time", "5 minutes") # Group data into 5-minute candlesticks by symbol candlestick_df = df \ .groupBy("symbol", window("open_time", "5 minutes", startTime=0)) \ .agg( expr("first(open_price) as open_price"), expr("max(high_price) as high_price"), expr("min(low_price) as low_price"), expr("last(close_price) as close_price"), expr("sum(volume) as volume") ) query_kafka = candlestick_df \ .selectExpr("CAST(symbol AS STRING) AS key", "to_json(struct(*)) AS value") \ .writeStream \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \ .option("topic", output_topic) \ .option("checkpointLocation", checkpoint_dir) \ .outputMode("update") \ .option("truncate", "false") \ .trigger(processingTime='5 minutes') \ .start()
@tonee84
@tonee84 Месяц назад
Thanks for the video, but can you shred some lights which cases we prefer Window functions to "traditional" expr?
@easewithdata
@easewithdata Месяц назад
Hello, Both does the same work, you can use them anytime as per your convenience.
@nikschuetz4112
@nikschuetz4112 Месяц назад
if you can cast it to a string, why not able to cast it to json, or map it out and use a json parser?
@easewithdata
@easewithdata Месяц назад
Yes you can definitely cast as per your wish.
@nikschuetz4112
@nikschuetz4112 Месяц назад
you can also just alias the struct and then select * from it. for example exploded_df.select(“data_devices.*”) and it gives all the columns
@ansumansatpathy3923
@ansumansatpathy3923 Месяц назад
Why is there a shuffle write for the read stage from files to a dataframe? Does that involve a shuffle? Also a shuffle write for only kbs worth data?
@easewithdata
@easewithdata Месяц назад
Shuffle is only involved when the next step is a wide operation. And for KBs data, it depends on the next stage. If you have count it will first make a local count before shuffling the data (which is reduced to kbs)
@rakeshpanigrahi577
@rakeshpanigrahi577 Месяц назад
Thanks, buddy. explode_outer is also an important concept.
@easewithdata
@easewithdata Месяц назад
Yes definitely it is. Bus its impossible to cover all functions on RU-vid. You can definitely checkout Spark documentation for more information. This series was meant to enable people with zero or basic knowledge to get started
@irenebinoy4002
@irenebinoy4002 Месяц назад
Great explanations & effort taken. I have encountered an Py4JJavaError while executing the df_agg.writeStream.format("console").outputMode("complete").start().awaitTermination command.Could you explain why?
@easewithdata
@easewithdata Месяц назад
please check the logs to debug the issue.
@souravnandy8385
@souravnandy8385 Месяц назад
It's really great, I do have a question. What if the JSON is 5 to 6 level nested struct. How can we make it to tabular format?