No video :(

Spark Job, Stages, Tasks | Lec-11

MANISH KUMAR

Подписаться 21 тыс.

Просмотров 32 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

29 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 165

@manish_kumar_1 Год назад

Directly connect with me on:- topmate.io/manish_kumar25

@roshankumargupta46 2 месяца назад

If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference. If you disable schema inference and provide your own schema, you can avoid the job triggered by schema inference.

@arpanscreations6954 Месяц назад

Thanks for your clarification

@fury00713 7 месяцев назад

In Apache Spark, the spark.read.csv() method is neither a transformation nor an action; spark.read.csv() is a method used for initiating the reading of CSV data into a Spark DataFrame, and it's part of the data loading phase in Spark's processing model. The actual reading and processing of the data occur later, driven by Spark's lazy evaluation model.

@ChetanSharma-oy4ge 6 месяцев назад

Fir action jobs kaise ban rahe hai? Mtlb ager action is equal to jobs , to better way kya hai find out kerne ka?

@roshankumargupta46 2 месяца назад

If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference.

@user-tb8ry2jl7s 10 месяцев назад

1 job for read, 1 job for print 1 1 job for print 2 1 job for count 1 job for collect total 5 jobs according to me but i have not run the code not sure

@shorakhutte1887 9 месяцев назад

bro next level ka explanation tha... thanks for sharing your great knowledge. keep up the good work. Thanks

@satyammeena-bu7kp 2 месяца назад

Really Awsome Explanation ! Esa Explanation kabhi or ni mil sakta hai Thank you so much

@devjyotipattnaik8588 Месяц назад

Great explanation manish!! As per my understanding from the video, total 2 jobs,4 stages,204 tasks will be created job 1 - Read - which consist 1 stage and 1 task from job1 -> till job 2 gets created we have 3 stages and 203 tasks stage 2 - Repartition (wide dependency transformation) - 1 task is created stage 3 - select and filter(narrow dependency transformation) - 2 tasks are created - > 1 for each transformation stage 4 - group by(wide dependency transformation) - 200 tasks are created Plz correct me,if i am wrong

@mrinalraj4801 4 месяца назад

Great Manish. I am grateful to you for making such rare content with so much depth. You are doing a tremendous job by contributing towards the Community. Please keep up the good work and stay motivated. We are always here to support you.

@ChetanSharma-oy4ge 6 месяцев назад

Ek question tha ki, order kya hona chahaye likhne ka, Mtlb ki ager hum filter/select/partition/group by/distinct/count ya or kuch bhi ker rahe hai to, sabsay pehla kya likhna chahaye…

@yashkumarjha5733 6 дней назад

Bro dekho agar optimized way me me likhna chah re ho then first apply filter and then apply transformations. For example : agar Maan lo k mere pass data hai 100 employees ka or mujhe sirf 90000 se greater salary vaale employees chahiye or mujhe unn sabhi employees ko promotion Krna hai matlab k sabhi ka salary or badhaana hai. Toh suppose iss case me tu pehle transformation lagaoge k salary + vaala then filter kroge toh poora 100 employees ka data scan hoga but agar pehle hi filter laga loge or suppose 90000 se jyada salary vaale sirf 2 log hai toh agar pehle filter laga lenge then we just need to scan only 2 employees data. Sorry example thoda kharab tha but shyd concept samajh aa gaya hoga. By the way agar aap pehle transformation lagane k baad bhi filter laga re ho toh bhi koi dikkat nahi hoga kyuki Spark internally is designed in a way k vo optimized tareeke se hi run krega toh jab tumhaara poora operation perform ho gaya uske baad tum jab job hit kroge then it'll first do filter and then apply other transformations on top of it Spark is very intelligent and will do in optimized way. I hope this answers your question

@nishasreedharan6175 4 месяца назад

One of the best videos ever . Thank you for this . Really helpful.

@naturehealingandpeace2658 8 месяцев назад

Wow Kya clear explanation tha,first time understood in.one.go

@user-ru2fy8nx1z 6 месяцев назад

I have one doubt there are 3 actions are there such as read,collect and count, but why it is creating 2 job only ?

@tejathunder 4 месяца назад

In Apache Spark, the read operation is not considered an action; it is a transformation.

@stuti5700 11 месяцев назад

very good content. Please make detail videoes on spark job optimization

@shreyaspurankar9736 Месяц назад

great explanation :)

@AnandVerma 5 месяцев назад

Num Jobs = 2 Num Stages = 4 (job1 = 1, job2 = 3) Num Tasks = 204 (job1 = 1, job2 = 203)

@rohitbhawle8658 10 месяцев назад

nice explain ,each and every concept you clear keep it up

@asif50786 Год назад

Start a playlist with guided projects ,,so that we can apply these things in real life..

@akhiladevangamath1277 3 месяца назад

Thank you so much Manish

@deepaksharma-xr1ih Месяц назад

good job man

@Rafian1924 Год назад

Your channel will grow immensely bro. Keep it up❤

@rahuljain8001 Год назад

Didn't find such a detailed explanation, Kudos

@Rajeshkumbhkar-x6v Месяц назад

Amazing explanation sir jii

@Food_panda-hu6sj Год назад

One question: after groupby by default 200 partitions will be created where each partition will hold data for individual key. What happens if there are less keys like 100 , will it lead to formation of only 100 partition insted of 200? AND What happens if the individual keys are more than 200 in number, will it create more than 200 partitions?

@Useracwqbrazy Год назад

I really liked this video....nobody explained at this level

@KapilKumar-hk9xk Месяц назад

Excellent explaination. One question, with groupBy 200 tasks are created but most of these tasks are useless right. How to avoid such scenarios. Coz it will take extra effort for spark for scheduling such empty partition task right...

@rushikeshsalunkhe8892 17 дней назад

You can repartition it to less number of partition or you can tweak the spark.sql.shuffle.partition config by setting it to desirable number.

@utkarshkumar6703 6 месяцев назад

Bhai bahut sahi explain karte ho aap

@mantukumar-qn9pv Год назад

Thanks Manish Bhai...Please keep continue your video

@ShrinchaRani Год назад

Explained so well that too bit- by- bit 👏🏻

@kyransingh8209 8 месяцев назад

@manish_kumar_1 : correction - job2 - stage2 is till group by, and job2 - stage 3 is till collect

@sharma-vasundhara 6 месяцев назад

Sir, in line 14 - we have .groupby and .count .count is an action, right? Not sure if you missed it by mistake or if it doesn't count as an action? 🙁

@tanmayagarwal3481 5 месяцев назад

I had the same doubt,Did you get the answer to this question? As per the UI also it has mentioned only 2 jobs whereas count should be an action :(

@AnkitNub 3 месяца назад

I have a question, if one job has 2 consecutive wide dependency transformation then 1 narrow dependency and again 1 wide dependency how many stages will be created. Suppose repartition, after that groupby, then filter and then join, how many stages will this create?

@jaisahota4062 Месяц назад

same question

@arshadmohammed1090 3 месяца назад

great job bro, you are doing well.

@sugandharaghav2609 Год назад

I remember wide dependency you explained in shuffling.

@princekumar-li6cm 4 месяца назад

Count is also an action.

@manish_kumar_1 4 месяца назад

And transformation too

@manish_kumar_1 4 месяца назад

And transformation too

@sudipmukherjee6878 Год назад

Bhai was eagerly waiting for your videos

@ADESHKUMAR-yz2el 3 месяца назад

bhaiya you are grate

@kumarankit4479 2 месяца назад

Hi bhaiya. Why havent we considered collect() as a job creator here in the program you discussed?

@VenkataJaswanthParla 2 месяца назад

Hi Manish, Count() is also an action right ? If not can you please explain what is count()

@saumyasingh9620 Год назад

This was so beautifully explained.

@rohanchoudhary672 3 месяца назад

df.rdd.getNumPartitions(): Output = 1 df.repartition(5) df.rdd.getNumPartitions(): Output = 1 Using community databricks sir

@manish_kumar_1 3 месяца назад

Yes I don't see any issue. If you won't assign your repartition df to some variable then you will get same result

@samirdeshmukh9886 Год назад

thank you sir..

@nikhilhimanshu9758 7 месяцев назад

kya hota agar filter ke baad ek aur narrow transformation hota like filter --> flatmap--> select iska kitna task banta ?

@shekharraghuvanshi2267 11 месяцев назад

Hi Manish, in the second job there were 203 tasks and 1st job there was 1, so in total 204 tasks are there in complete application? i am bit confused between 203 and 204. Kindly clarify..

@parthagrawal5516 4 месяца назад

After repartition(3) Still 200 default partition will show there on dag Sir

@tanushreenagar3116 6 месяцев назад

VERY VERY HELPFUL

@maruthil5179 Год назад

Very well explained bhai.

@lalitgarg8965 3 месяца назад

count is also a action, so there would be 3 jobs?

@aditya9c 5 месяцев назад

waah

@sairamguptha9988 Год назад

Glad to see you manish... Bhai any update on project details?

@GauravKumar-im5zx Год назад

glad to see you brother

@user-rh1hr5cc1r 4 месяца назад

7:40 print is an action ,so it should be 4 job in given code. ryt????correct me if i am wrong

@jatinyadav6158 7 месяцев назад

Hi @manish_kumar_1, I have one question in wide transformation you said that in groupBy stage3 there will be 200 tasks according to the 200 partitions. But can you tell me why these 200 partitions happened in the first place.

@deeksha6514 5 месяцев назад

An Executor can have 2 partitions or is it like that partitions means it will be there on two different machines.

@venugopal-nc3nz Год назад

Hi Manish, In databricks also when groupby() invoke it create 200 task by default ? How to reduce 200 task when using group by() for optimizing spark job ?

@manish_kumar_1 Год назад

There is a configuration which can be set. Just google how can I set fix number of partition after join

@jilsonjoe6259 Год назад

Great Explanation ❤

@lucky_raiser Год назад

bro, how many more days will spark series take and will you make any complete DE project with spark at last. BTW watched and implemented all your theory and practical videos. Great sharing❤

@saumyasingh9620 Год назад

When I ran in notebook, it gave 5 jobs like below, and not only 2 for this snippet of code. Can you explain.: Job 80 View(Stages: 1/1) Stage 95: 1/1 Job 81 View(Stages: 1/1) Stage 96: 1/1 Job 82 View(Stages: 1/1) Stage 97: 1/1 Job 83 View(Stages: 1/1, 1 skipped) Stage 98: 0/1 skipped Stage 99: 2/2 Job 84 View(Stages: 1/1, 2 skipped) Stage 100: 0/1 skipped Stage 101: 0/2 skipped Stage 102: 1/1

@vinitsunita Год назад

Very good explaination bro

@gujjuesports3798 8 месяцев назад

.count on group dataset is transformation not action, correct ? if it was like employee_df.count() then it would be action

@salmansayyad4522 4 месяца назад

excellent!!

@amazhobner 7 месяцев назад

Around 6:35 it is wrong, read can avoid being a action, if inside schema you pass a manual created schema containing a list of all columns. Refer for practical: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-VLi9WS8SJFY.html

@rajkumardubey5486 Месяц назад

Count bhi ek action hoga na means 3 job create hua

@mayankkandpal1565 9 месяцев назад

nice explanation.

@garvitgarg6749 9 месяцев назад

Bhai, I'm using Spark 3.4.1 and in that when I group data using groupby (I have 15 records in dummy dataset) it create 4 jobs to process 200 partitions why ? Is this the latest enhancement ? and not only in latest version but also in spark 3.2.1 I observed same thing. Could you please explain this ?

@villagers_01 2 месяца назад

bhai job action pe create hota h but no. of action != no of jobs. kyonki jobs create hota h jab new rdd ka jarurat hota h. in groupby we need shuffling of data aur rdd immutable hota h to naya rdd banana hi padta h after shuffling. isliye jab bhi naya rdd banane ka jarurat hota h to ek job create hota.aapke case 1 job read ke liye, 1 job schema ke liye,1 job shuffling ke liye aur ek job display ke liye.

@jay_rana 5 месяцев назад

what if the spark.sql.shuffle.partitions is set to some value, in this case what will be the no of tasks after/in groupby stage ?

@codingwithanonymous890 10 месяцев назад

Sir make playlist for other data engineer tools also

@ramyabhogaraju2416 Год назад

I have been waiting for your video how many more days will it take spark series to complete

@ranvijaymehta Год назад

Thanks Sir

@prashantsrivastava9026 Год назад

Nicely explained

@gauravkothe9558 9 месяцев назад

Hii sir, i have one doubt like collect will create the task and stage or not because you mentioned like 203 task

@user-hb6vq9rg1h Год назад

Can i know about executors. How many executors will be there in worker node? And Is the no.of executors depend on no.of cores in worker node?

@divyabhansali2182 Год назад

Is 200 default task in group by even if only 3 distinct ages are there?If so what will be there in rest of the 197 task (which age group will be there)

@kumarabhijeet8968 Год назад

Manish bhai total kitne videos rhenge theory or practical wale series mein?

@riteshrajsingh7437 3 месяца назад

Hi manish, i have a doubt in groupby count is also a action then why it is not counted as a action?

@manish_kumar_1 3 месяца назад

Aage kuch videos me clear ho jayega

@user-lh7bw6vw4e 8 месяцев назад

One query @Manish , spark.read is a transformation and not an action right?

@dpkbit08 6 месяцев назад

Same example I tried and in my case, 4 jobs are created. Is there any other config needed?

@moyeenshaikh4378 10 месяцев назад

bhai job 2 to collect se chalu hoga na? to read ke bad se collect tak job 1 hi chalega na?

@mohitgupta7341 Год назад

Bro amazing explanation

@sachindramishra2813 6 месяцев назад

df=spark.read.parquet() print(df.rdd.getNumPartitions()) df=df.repartition(2) print(df.rdd.getNumPartitions()) df=df.groupby("AccountId").count() df.collect() Why this code creates 5 Spark jobs in Databricks? I've also used 2 actions only

@sravankumar1767 5 месяцев назад

Nice explanation 👌 👍 👏 but can you please explain in English. Every one can see all over the world 🌎 ✨

@ankitachauhan6084 3 месяца назад

why was count() not counted as action while countuing in jobs ?

@manish_kumar_1 3 месяца назад

Aage ke lecture me samjh me aayega

@prabhatsingh7391 Год назад

Hi Manish , count is also an action and you have written count just after group by in code snippet,why count is not considered as job here.

@manish_kumar_1 Год назад

Aage me videos me iska explanation mil jayega. Count action v hai and transformation v. Kab kon sa hoga uske liye aage videos me detailed me explain kara hai

@MiliChoksi-gc8if 4 месяца назад

So if any group by is there we have to consider 200 task?

@manish_kumar_1 4 месяца назад

Yes if aqe is disabled. If it is enabled then count depends on data volume and default parallelism

@sharmadtadkodkar3731 4 месяца назад

What command did you use to run the job?

@praveenkumarrai101 Год назад

❤

@AnkitaSakseria 4 месяца назад

count is also an action , why job not created for it??

@manish_kumar_1 4 месяца назад

Count is an action and transformation both. Aage ke lectures me pata chal jayega

@mr.random2001 8 месяцев назад

@manish_kumar_1 In the previous videos you said like count() as action, but in these video you are not taking that as action, WHY ??

@serenitytime1959 3 месяца назад

could you please upload the required files, I just want to run and see by myself.

@kunalk3830 10 месяцев назад

Bhai mujhe toh ye 3 jobs, 3 stages and 4 tasks dikha raha. Job 0 for load with 1 task --> Job 1 for collect with 2 task --> Job 3 for collect with 1 task but is showing skipped. Didn't get whats wrong used the same code data but used different data 7MB size.

@manish_kumar_1 10 месяцев назад

No need to get to rigid here. Spark bahut saare optimization karta hai and multiple time some jobs get skipped. Maine controlled environment me Kiya tha to show how does that work. During project development you are not going to count how many jobs,stages or task are there. So even if you don't get the same number just chill

@wgood6397 5 месяцев назад

pls enable subtitle bro

@RahulAgarwal-uh3pf 3 месяца назад

jo code snipet suru m dikhaya h usme count bhi ek job h right?

@manish_kumar_1 3 месяца назад

Nhi aage ke lecture me aapko pata chalega why

@mahnoorkhalid6496 11 месяцев назад

I have executed the same job, but it created 4 jobs with each having 1 stage and 1 task. I think for every wide transformation, it created a new job. Please please confirm and guide.

@villagers_01 2 месяца назад

@Watson22j Год назад

koi baat nhi bhaia, bs ye series poora khatam kr dena kyoki etne detail me yt pe kisi nhi nhi btaya hai.

@codetechguru1 6 месяцев назад

why not stage 4, becuse you say each job has minimum one stage and one task so why job 3 don't included to stage and task ?

@dilipkuncha5728 9 месяцев назад

Hi manish , will count() not be considered as a action ?

@manish_kumar_1 9 месяцев назад

Count action and transformation both hai. Aage ke lecture me clear ho jayega

@rajukundu1308 Год назад

Number of actions is equal to number of jobs. In mentioned code snipped there was thrre actions (read.count,collect) . As per theory three job id should create. But in spark ui only two job is created. Can you help me on this.

@rajukundu1308 Год назад

why three job id not credited?.

@manish_kumar_1 Год назад

Count is a transformation and action both. In the given example it is working as transformation not as an action. I will be uploading aggregation video soon. There you will get to know more about count behavior

@rajukundu1308 Год назад

@@manish_kumar_1 thanks for prompt response.sure, eagerly waiting for your new video

@narag9802 8 месяцев назад

do you have English version of videos

@sivamani7711 4 месяца назад

Hi Manish, can you make same content in English

@manish_kumar_1 4 месяца назад

I am planning to shoot whole spark series in English too. But not yet finalized when will I start

@sivamani7711 4 месяца назад

@@manish_kumar_1thanks

@venumyneni6696 Год назад

Hi Manish, Why doesn't the collect() method create a new stage (stage 4) in Job2 as it needs to send the data from 200 partitions into the driver node ?

@manish_kumar_1 Год назад

Collect is an action not a transformation

@venumyneni6696 Год назад

@@manish_kumar_1 Thanks for the reply Manish. What happens after the groupBy in this case ? Spark transfers the data in 200 partitions to driver right ? Don't we need any tasks for that process ? Thanks in advance.

@manish_kumar_1 Год назад

@@venumyneni6696 I think you are missing some key point of driver and executor. Please clear your basics, read multiple blogs or watch my videos in sequence