Upsert data to any Distributed Storage using Apache Spark | Data Engineering Interview question

Подписаться 108 тыс.

Просмотров 3,3 тыс.

50% 1

In this video, I have discussed a way of upserting data to any distributed storage using Spark.
So what does Upsert mean In Data Engineering? Let's find it out.
Find the source code here
github.com/ankur334/sparkBoot...
𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮:
🔅 Topmate - (Book 1:1 or other sessions)
topmate.io/ankur_ranjan
🔅 LinkedIn - / thebigdatashow
🔅 Instagram - / the_big_data_show
#DataEngineering #apachespark #bigdata #interview #dataengineerjob #careerswitch #job #scala

Опубликовано:

10 авг 2022

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 27

@toshmishra7695 Год назад

Best person to look upto

@ashishambre1008 Год назад

That’s a real world problem and you have explained it quite nicely. Keep up the good work buddy 👍🏻

@sayyadsalman9132 Год назад

Thanks bro, you have really given a solution for real problem.

@TheBigDataShow 2 месяца назад

Thank you for your kind words. We are also creating multiple Data Engineering Interview questions to practice in the community section of our RU-vid channel. Visit our channel and then go to the community tab to find all the questions for practice. Nd Do watch our other Data Engineering Mock interview by following the Mock Interview playlist too. We have more than 25 Data Engineering Mock interviews.

@sahil0094 Год назад

If seller id is primary key then union operation will fail.

@mohammedasif4349 Год назад

Nice explaination but in video you're saying to use union but in code I guess you applied union all that's why getting duplicates. Please correct me if I'm wrong.

@TheBigDataShow Год назад

I have used the Union itself. I think we don't have duplicates. Please check it once.

@mohammedasif4349 Год назад

@@TheBigDataShow Please see video exactly at 16-25 duration. I took screenshot but can't paste it here so.

@TheBigDataShow Год назад

Yup, I can see that. Thanks for correcting. UNION in spark 2.x DataFrame documentation says that it will have duplicates. It will have duplicates. I just read the documentation. We should apply dropDuplicates() or distinct() after that. But, this does not change our concept because in the end we are writing row_number() and filtering the result where row_number is 1. Thanks again👏💯

@mohammedasif4349 Год назад

Yeah solution remains correct because of rank function. Thanku for explaining this spark documentation thing also 😅

@AA-gp2vv Год назад

Sorry, with this approach we have to read full existing data then we have to perform operation, it will increase cost and time (suppose existing data 500TB and new data is 600TB). Not so good, any other approaches

@TheBigDataShow Год назад

No need for reading full data. While you are writing data for the first time maintain one audit bucket where you should write down the latest creation_time. While reading for the second time just apply predicate pushdown filters on the source itself by using your audit information.

@TheBigDataShow Год назад

I have been taking full data for the second time just to make the concept easier.

@AA-gp2vv Год назад

Thanks Man, I got it now, if possible please make a video on it also

@TheBigDataShow Год назад

Will do. These all will covered in the full data pipeline. Stay tuned

@rohankhoja7599 Год назад

@@TheBigDataShow still didn't understand the audit concept. Any reference please?

@rohansrivastwa827 Год назад

Please don't use dark mode, not easy to see it

@TheBigDataShow Год назад

Ohh ok, I will consider it. Hope you are able to learn from this only🫣. Point notes, go and watch the video with little difficulty.

@ankurranjan3218 Год назад

Or, Please watch in your Laptop.

@mmohammedsadiq2483 7 месяцев назад

to much repetition in explanation , edit and short the video

@sahil0094 Год назад

Good company? You’re in Walmart? You should change your company first 😂

@TheBigDataShow Год назад

Sorry I didn't get you. I am working with a wonderful team and awesome managers. I am very happy with my work. For me a good team matters a lot more than any good company. I am good buddy 😀