Тёмный

Upsert data to any Distributed Storage using Apache Spark | Data Engineering Interview question 

The Big Data Show
Подписаться 108 тыс.
Просмотров 3,3 тыс.
50% 1

In this video, I have discussed a way of upserting data to any distributed storage using Spark.
So what does Upsert mean In Data Engineering? Let's find it out.
Find the source code here
github.com/ankur334/sparkBoot...
𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮:
🔅 Topmate - (Book 1:1 or other sessions)
topmate.io/ankur_ranjan
🔅 LinkedIn - / thebigdatashow
🔅 Instagram - / the_big_data_show
#DataEngineering #apachespark #bigdata #interview #dataengineerjob #careerswitch #job #scala

Опубликовано:

 

10 авг 2022

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 27   
@toshmishra7695
@toshmishra7695 Год назад
Best person to look upto
@ashishambre1008
@ashishambre1008 Год назад
That’s a real world problem and you have explained it quite nicely. Keep up the good work buddy 👍🏻
@sayyadsalman9132
@sayyadsalman9132 Год назад
Thanks bro, you have really given a solution for real problem.
@TheBigDataShow
@TheBigDataShow 2 месяца назад
Thank you for your kind words. We are also creating multiple Data Engineering Interview questions to practice in the community section of our RU-vid channel. Visit our channel and then go to the community tab to find all the questions for practice. Nd Do watch our other Data Engineering Mock interview by following the Mock Interview playlist too. We have more than 25 Data Engineering Mock interviews.
@sahil0094
@sahil0094 Год назад
If seller id is primary key then union operation will fail.
@mohammedasif4349
@mohammedasif4349 Год назад
Nice explaination but in video you're saying to use union but in code I guess you applied union all that's why getting duplicates. Please correct me if I'm wrong.
@TheBigDataShow
@TheBigDataShow Год назад
I have used the Union itself. I think we don't have duplicates. Please check it once.
@mohammedasif4349
@mohammedasif4349 Год назад
@@TheBigDataShow Please see video exactly at 16-25 duration. I took screenshot but can't paste it here so.
@TheBigDataShow
@TheBigDataShow Год назад
Yup, I can see that. Thanks for correcting. UNION in spark 2.x DataFrame documentation says that it will have duplicates. It will have duplicates. I just read the documentation. We should apply dropDuplicates() or distinct() after that. But, this does not change our concept because in the end we are writing row_number() and filtering the result where row_number is 1. Thanks again👏💯
@mohammedasif4349
@mohammedasif4349 Год назад
Yeah solution remains correct because of rank function. Thanku for explaining this spark documentation thing also 😅
@AA-gp2vv
@AA-gp2vv Год назад
Sorry, with this approach we have to read full existing data then we have to perform operation, it will increase cost and time (suppose existing data 500TB and new data is 600TB). Not so good, any other approaches
@TheBigDataShow
@TheBigDataShow Год назад
No need for reading full data. While you are writing data for the first time maintain one audit bucket where you should write down the latest creation_time. While reading for the second time just apply predicate pushdown filters on the source itself by using your audit information.
@TheBigDataShow
@TheBigDataShow Год назад
I have been taking full data for the second time just to make the concept easier.
@AA-gp2vv
@AA-gp2vv Год назад
Thanks Man, I got it now, if possible please make a video on it also
@TheBigDataShow
@TheBigDataShow Год назад
Will do. These all will covered in the full data pipeline. Stay tuned
@rohankhoja7599
@rohankhoja7599 Год назад
@@TheBigDataShow still didn't understand the audit concept. Any reference please?
@rohansrivastwa827
@rohansrivastwa827 Год назад
Please don't use dark mode, not easy to see it
@TheBigDataShow
@TheBigDataShow Год назад
Ohh ok, I will consider it. Hope you are able to learn from this only🫣. Point notes, go and watch the video with little difficulty.
@ankurranjan3218
@ankurranjan3218 Год назад
Or, Please watch in your Laptop.
@mmohammedsadiq2483
@mmohammedsadiq2483 7 месяцев назад
to much repetition in explanation , edit and short the video
@sahil0094
@sahil0094 Год назад
Good company? You’re in Walmart? You should change your company first 😂
@TheBigDataShow
@TheBigDataShow Год назад
Sorry I didn't get you. I am working with a wonderful team and awesome managers. I am very happy with my work. For me a good team matters a lot more than any good company. I am good buddy 😀
@ishansharma2099
@ishansharma2099 Год назад
@sahil pahuja, teri kyu jal rahi hai bhai dusre ki success se, pehle apni service based se bahar aao then comment on someone else.
@jhonsen9842
@jhonsen9842 Месяц назад
Looks like you are working at Gamazon or Gagul who fired 10k employees.
Далее
Optimize read from Relational Databases using Spark
34:53
Spark Standalone Architecture
26:24
Просмотров 23 тыс.
Фонтанчик с черным…
01:00
Просмотров 1,5 млн
Autoloader in databricks
25:48
Просмотров 16 тыс.
The Harsh Reality of Being a Data Engineer
14:21
Просмотров 222 тыс.
Implementing SCD Type 2 using Delta
13:56
Просмотров 19 тыс.
Фонтанчик с черным…
01:00
Просмотров 1,5 млн