Afaque Ahmad

12
102 438

Hey, I’m Afaque Ahmad, a Senior Data Engineer at QuantumBlack, McKinsey & Company and previously at Urban Company (formerly UrbanClap). This channel was created out of my endeavour for teaching and simplifying complex data engineering concepts and, in doing so, perhaps offer a fresh perspective on how we approach them.

Watch out videos on my channel for best-in-class videos on Apache Spark, complex SQL, Advanced Python, and, emerging topics in the area of data engineering.

LinkedIn: www.linkedin.com/in/afaque-ahmad-5a5847129/
GitHub: github.com/afaqueahmad7117
Contact Email: dataengineer7117@gmail.com

Комментарии

@arghyakundu8558 2 часа назад

Excellent Content..!! Loved It. Such detailed explanation on Salting Technique with Graphical Representation.

@the_gamer2416 23 часа назад

Hi Sir Please Make a Detailed course on apache spark which include every aspect of spark for Data Engineer role Please make sure there are a lot of begineer course here in market keep the course from intermediate level to advance level. Please tr to make video in Hindi it will be very helpful.

@vishalpathak8266 День назад

Thank you for this video !!

@bhargaviakkineni День назад

Sir please do a video on executor out of memory in spark and driver out of memory in spark

@snehitvaddi 3 дня назад

This is helpful, but I still have a few doubts. 1. If Broadcast join is immune to skewness, why there is Salting technique? 2. In the Broadcast join example, the customer dataset appeared to be outside of any executor. Where is it actually stored? How can we specify its storage location?

@shaifalipal9415 3 дня назад

Broadcast is only possible if the other table is really small to be replicated

@narutomaverick 3 дня назад

Want to understand better, Read this: Spark CacheLLM Response: 1. **Why Use Caching?** - Caching can significantly improve performance by reusing persistent data instead of recomputing it - It helps avoid redundant computations on the same dataset across multiple actions 2. **Lazy Evaluation and Caching** - Apache Spark uses lazy evaluation, where transformations are not executed until an action is triggered - Caching can help by materializing the result of a long sequence of transformations, avoiding recomputation 3. **Spark's Lineage Graph** - Spark tracks the lineage of transformations using a Lineage Graph - Caching breaks the lineage, reducing the size of the graph and improving performance 4. **Caching vs. No Caching** - The demo shows a significant performance improvement when caching is used, as seen in the Spark UI 5. **Persist and Storage Levels** - The `persist()` method is used for caching, with different storage levels available - Storage levels like `MEMORY_ONLY`, `DISK_ONLY`, and combinations control memory/disk usage and replication - Choose the appropriate storage level based on your requirements and cluster resources 6. **When to Cache?** - Cache datasets that are reused multiple times, especially after a long sequence of transformations - Cache intermediate datasets that are expensive to recompute - Be mindful of cluster resources and cache judiciously 7. **Unpersist** - Use `unpersist()` to remove cached data and free up resources when no longer needed - Spark may automatically unpersist cached data if memory is needed If you liked it, Upvote it. NarutoLLM Response

@choubeysumit246 4 дня назад

Great tutorials 🙏, please create more videos on spark from beginners point of view

@narutomaverick 7 дней назад

Your channel is so underrated, Please dont stop

@user-pq9tx6ui2t 7 дней назад

i like very much of your videos, it's insightful. can you please make series/videos on Spark interview oriented questions. Thanks in advance

@mohitupadhayay1439 10 дней назад

Hi Afaque. A suggestion. You could start from the beginning to connect the DOTS! Like if in your scenario we have X Node machine with Y workers and Z exectors and if you do REPARTITION and fit the data like this then this could happen. Otherwise the Machine would sit idle and so on.

@tumbler8324 10 дней назад

Perfect explanation & perfect examples throughout the playlist, Bhai mere Change data capture aur Slowly changing dimension jo bhi apply hote hain project me uska bhi khel samza de.

@afaqueahmad7117 9 дней назад

Thanks for the kind words bhai @tumbler8324. Sab ayega bhai kuch waqt mein, pipeline mein hai :)

@user-pq9tx6ui2t 11 дней назад

a lot of knowledge in just one video

@afaqueahmad7117 9 дней назад

Appreciate it @user-pq9tx6ui2t :)

@skybluelearner4198 11 дней назад

I spent INR 42000 on a Big Data course but could not understand this concept clearly because the trainer himself lacked clarity. Here I understood completely.

@afaqueahmad7117 9 дней назад

Appreciate the kind words @skybluelearner4198 :)

@Dhawal-ld2mc 14 дней назад

Great explanation of such a complex topic, thanks and keep up the good work.

@afaqueahmad7117 9 дней назад

Thanks man @Dhawal-ld2mc :)

@mahendranarayana1744 14 дней назад

Great explanation, Thank you, But how would we know how to configure exact (at least best) "spark.sql.shuffle.partitions" at run time? Because each run/day the volume of the data is going to be changed. So, how do we determine the data volume at run time to set the shuffle.partitions number?

@SurendraKumar-qj9tv 15 дней назад

Awesome explanations! pls share us more relevant videos

@mohitupadhayay1439 15 дней назад

Dead gorgeous stuff.

@afaqueahmad7117 9 дней назад

Appreciate it man :)

@mohitupadhayay1439 15 дней назад

Hey Afaque Great tutorials. You should consider doing a full end to end spark project with a Big volume of data so we can understand the challenges faced and how to tackle them. Would be really helpful!

@afaqueahmad7117 9 дней назад

A full-fledged in-depth project using Spark and the modern data stack coming soon, stay tuned @mohitupadhayay1439 :)

@sonlh81 16 дней назад

Not easy to understand, but it great

@Akshaykumar-pu4vi 17 дней назад

Useful information

@leonardopetraglia6040 17 дней назад

Thanks for the video! I also have a question: when I execute complex query, there will be multiple stage with different shuffle write sizes, which do I have to take in consideration for the computation of the optimal number of shuffle partitions?

@deepikas7462 19 дней назад

All the concepts are clearly explained. Please do more videos.

@afaqueahmad7117 9 дней назад

Appreciate the kind words @deepikas7462, more coming soon :)

@abusayed.mondal 19 дней назад

Your teaching skill is very good, please make a full series on PySpark, that'll be helpful for so many aspiring data engineers.

@afaqueahmad7117 9 дней назад

Appreciate the kind words @abusayed.mondal, more coming soon, stay tuned :)

@muhammadzakiahmad8069 20 дней назад

Please make one on AWE aswell

@afaqueahmad7117 9 дней назад

You mean AWS?

@muhammadzakiahmad8069 9 дней назад

@@afaqueahmad7117 Sorry it was supposed to be AQE ( Adaptive Query Execution).

@afaqueahmad7117 9 дней назад

Complete details on AQE is here below :) ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-bRjVa7MgsBM.html

@muhammadzakiahmad8069 9 дней назад

@@afaqueahmad7117 Thanks🌟

@Ravi_Teja_Padala_tAlKs 21 день назад

Excellent 🎉 👍 appreciate your effort

@leonardopetraglia6040 21 день назад

Correct me if I'm wrong, but these calculations consider the execution of only one job at a time. How do the calculations change when there are multiple jobs running in a cluster, as often happens?

@snehitvaddi 22 дня назад

Buddy! You got a new sub here. Loved your detailed explanation. I see no one explaining the query plain this detail and I believe this is the right way of learning. But I would love to see an entire Spark series.

@afaqueahmad7117 9 дней назад

Thank you @snehitvaddi for the kind appreciation. A full-fledged, in-depth course on Spark coming soon :)

@snehitvaddi 9 дней назад

@@afaqueahmad7117 Most awaited. Keep up the 🚀

@piyushkumawat8042 22 дня назад

Why to give such a large fraction (0.4) to User memory as in the end when the transformations will be performed in a particular stage , whether we give it a user defined function or any other function execution memory will be only used . So Whats exactly the role of User Memory ??

@fitness_thakur 24 дня назад

could you please make video on stack overflow like what are scenario when it can occur and how to fix it

@afaqueahmad7117 9 дней назад

Are you referring to OOM (out of memory errors) - Driver & Executor?

@fitness_thakur 8 дней назад

@@afaqueahmad7117 No, basically when we have multiple layers under single session then at that time stack memory getting full so to break it we have to make sure we are using one session per layer. e.g- suppose we have 3 layers (internal, external, combined) and if you run these in single session then it will throw stackoverflow error at any place whenever its stack get overflow. We tried to increase stack as well but that was not working. Hence at the last we come up with approach like will run one layer and then close session likewise

@dasaratimadanagopalan-rf9ow 26 дней назад

Thanks for the content, really appreciate it. My understanding is AQE take care of Shuffle Partition Optimization and we don't need to manually intervene (starting spark 3) to optimize shuffle partitions. Could you clarify this please?

@ashutoshpatkar4891 28 дней назад

Hey man. learnt a lot from the video. please help me out on this doubt for example 2, total executors = 44/4 = 11 you have said. But shouldn't we think machine by machine, here each machine can have, 15/4 === 3 executors if 4 core for each, giving total 3*3 nodes = 9. in your workout, it seems like there will be an executor which will use some cores from one node and some from other. Am I wrong in my thought process somewhere?

@ajaydhanwani4571 28 дней назад

sorry if I am asking very basic question, can we set executors per spark job or per spark cluster? Also how to set this up using coding examples and all

@dudechany 29 дней назад

Every-time I come here before attending an interview , I try to give this video a like , but end up realising that I already did it earlier. Best video on this topic on whole internet.

@afaqueahmad7117 9 дней назад

This means a lot to me @dudechany, I really thank you for the generous and kind appreciation :)

@PratikPande-k5h 29 дней назад

Really appreciate your efforts. This was very easy to understand and comprehensive as well.

@afaqueahmad7117 9 дней назад

@PratikPande-k5h Glad you're finding it easy to understand :)

@venkatyelava8043 Месяц назад

One of the cleanest explanation I ever come across on the internals of Spark. Really appreciate all the effort you are putting into making these videos. If you don't mind, May I know which text editor are you are using when pasting the Physical plan?

@afaqueahmad7117 9 дней назад

Many thanks for the kind words @venkatyelava8043, means a lot. On the text editor - I'm using Notion :)

@senthilkumarpalanisamy365 Месяц назад

Excellent and clear cut explanation, thanks much for taking time and preaparing the content. Please do more.

@afaqueahmad7117 9 дней назад

Appreciate it @senthilkumarpalanisamy365. More coming soon, stay tuned :)

@ridewithsuraj-zz9cc Месяц назад

This is the most detailed explanation I have ever seen.

@afaqueahmad7117 9 дней назад

Appreciate it man @ridewithsuraj-zz9cc :)

@satyajitmohanty5039 Месяц назад

Explanation is so good

@afaqueahmad7117 9 дней назад

Thank you @satyajitmohanty5039 :)

@rgv5966 Месяц назад

Hey @Afaque, great content as usual, but I thought this video could be a little concise, great work anyways!

@afaqueahmad7117 9 дней назад

Thank you @rgv5966 for the appreciation. Tried my best to keep it concise, but will take your feedback :)

@nikhillingam4630 Месяц назад

Consider a scenario where my first data shuffle size is 100gb then giving more shuffle partitions make sense now in the last shuffle data size is drastically reduced to 10gb according to calculations how would be to give shuffle partitions giving 1500 would benefit for the first shuffle and not for the last shuffle. How do one approach this scenario

@nikhillingam4630 Месяц назад

It's very useful ❤