How Salting Can Reduce Data Skew By 99%

Afaque Ahmad

Подписаться 6 тыс.

Просмотров 8 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

4 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 30

@dhavaldalasaniya 3 месяца назад

This is excellent Spark content videos. It is prefect explanation on Spark performance concept.

@afaqueahmad7117 2 месяца назад

Many thanks @dhavaldalasaniya, this means a lot, appreciate it :)

@sasadsasadsad 4 месяца назад

Precious 30 minutes, quality content

@afaqueahmad7117 3 месяца назад

Thank you @sasadsasadsad, appreciate it :)

@Wonderscope1 9 месяцев назад

Thanks for great content, You should of used Salt bae gesture when you said salting :) Is Slating still good approached if join is happening between two large datasets with hundreds of million rows? Explode will increase the number of rows for one dataset. Let's say 100,000,000 * 200 Salet_Number = 20,000,000,000 Rows

@gabriells9074 10 месяцев назад

Hi Afaque, thank you for another great explanation, I have a question, since AQE splits skewed partitions into smaller ones, is salting still useful when AQE is enabled ?

@HamsAnsari 11 месяцев назад

I have read and watched many things related to salting but this visual explanation just makes it really easy to comprehend it, plus really well articulated. Waiting for more videos to learn from :) Also could you recommend some books or other resources that have enabled you to attain this level of knowledge, Thanks!

@afaqueahmad7117 11 месяцев назад

Hey @user-nz7uh1qo5o, many thanks for the kind words, it means a lot to me, and, glad to know that the video was helpful. Most of the content is based on my work experiences + good ad-hoc content on Medium to which I could relate. My only humble suggestion is to be ruthless, get your hands dirty, question everything that's happening and search the internet if anything doesn't makes sense :)

@arghyakundu8558 Месяц назад

Excellent Content..!! Loved It. Such detailed explanation on Salting Technique with Graphical Representation.

@afaqueahmad7117 26 дней назад

Appreciate it @arghyakundu8558 :)

@janb4637 Месяц назад

I never see such a detailed explanation. Thank you very much @afaque Ahmad. Is there any way we can get the document.

@afaqueahmad7117 26 дней назад

Appreciate it @janb4637, let me try and put it on GitHub :)

@rgv5966 2 месяца назад

Hey @Afaque, great content as usual, but I thought this video could be a little concise, great work anyways!

@afaqueahmad7117 Месяц назад

Thank you @rgv5966 for the appreciation. Tried my best to keep it concise, but will take your feedback :)

@Sandeep-bl9ji 7 месяцев назад

Nice explaination

@sonlh81 2 месяца назад

Not easy to understand, but it great

@MuhammadAhmad-do1sk 4 месяца назад

Thanks for this. Love from 🇵🇰

@afaqueahmad7117 4 месяца назад

Appreciate it @MuhammadAhmad-do1sk, Love from India :)

@anubhavrastogi7463 6 месяцев назад

Hi, can you please help me why are we considering salt number 3 or4. Is this should be equal to number of shuffle partitions that we have in our data or the distinct values that we have in our dataset.Please explain.

@9figurelifestyle790 Год назад

@afaqueahmad7117 - Great topic and amazing explanation - Looking forward to learning more from you. One suggestion is to create more videos related to designing idempotent data pipelines, backfilling missed window data, simulating different production failures and how to approach them, coz I see more people are doing interview focused videos. These topics will mentor both entry level and mid level Data engineers to gain confidence in Data Engineering field

@afaqueahmad7117 Год назад

Glad you liked the video and the explanation! Really appreciate your feedback. Yes, all of that is in the roadmap, but for the upcoming year. The initial plan is to cover all aspects related to Performance Tuning + Foundations.

@SHUBHAM_707 4 месяца назад

what if the values are unique in join 1 to 1 join? will it create skew

@akshaybaura Год назад

can you show us if salting in aggregations was really worth it ? I'm skeptical that too many shuffles in salting will deteriorate the performance with salting.

@afaqueahmad7117 Год назад

Hey @akshaybaura, there will indeed be a performance dip due to shuffles when using Salting, but, without Salting you're at the risk of either: a. Getting OOM (out of memory) errors. b. Your jobs running 5-10x slower because fewer resources (cores and memory) are being used while the others remain underutilised. However, even when using Salting, the performance largely depends on factors like the size of dataset and the correct use of Salt Number.

@alokranjan7323 9 месяцев назад

hash(1,0)%3 how to calculate?

@vinothvk2711 8 месяцев назад

0%3

@afaqueahmad7117 8 месяцев назад

@vinothvk2711 is right. As outlined in the video, we're assuming h(1, 0) = 0, so it's equal to 0 % 3 = 0

@gudiatoka 4 месяца назад

After 3.0 salting is not useful

@afaqueahmad7117 4 месяца назад

Hey @gudiatoka, I wish it was so, but just in case you're referring to AQE as the solution, it isn't always very helpful, so you still need to resort to salting.

@gudiatoka 4 месяца назад

@@afaqueahmad7117 yes AQE and partition is useful and in case of larger dataframe when salting key applied to lower df it duplicated records making it more skewed then the concept of salting not valid at least for me...may be it servers different