No video :(

Incremental Data Load in Hive | Big data interview questions

Подписаться 14 тыс.

Просмотров 48 тыс.

50% 1

Hello Guys,
In this video series i have explained one of the most important Big data interview question i.e. How to handle incremental data load in apache hive. I have explained this Hive use case from very basic and in very detail.
Below are the other relevant videos from my channel:
Spark Streaming with kafka and HBase
---------------------------------------------------
• Apache Kafka with Spar...
Spark Streaming with kafka
---------------------------------------------------
• Spark streaming with K...
Installation of kafka on Cloudera quickstart VM
----------------------------------------------------------------------------
• Installing Apache Kafk...

Опубликовано:

21 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 117

@ArtAlive 4 года назад

Thanks a lot was an awesome explanation. I was searching the answer for this .nice thank you so much

@venkatramana7980 4 года назад

Bro, Really you are Hero.Helping others without expecting anything is really a big big thing.Thanks a lot Bro

@kumarrk6343 5 лет назад

9:40 to 12:40 .No words .What a simple explanation.Really mind blowing.I have been rejected more than 25 interviews till now though I have 2 years genuine big data experience.I have come to know where I am lacking . Definitely I can crack my next interview with the help of your videos.

@anshulbisht4130 4 года назад

what u did for 2 years ?

@sivak9750 4 года назад

Best and simple explanation. I didn't find this solution any where. Thanks alot!!.

@christiandave100 3 года назад

why extra subquery t2.. we can remove the second subquery e.g select t1.* from (selct * from inc_table)t1 join (select empid,max(modified_date) max_modified from inc_table t2 group by empid) s on t1.empid=s.empid and t1.modified_date=s.max_modified

@DeepakSharma_youtube 4 года назад

Very good explanation but I have a few questions, because I've used a slightly different approach in our prod environment, and this approach will also not solve our issue (Q3 below): Q1: @14:42 you didn't update the date to 2019-04-23 but it shows in your view. How? Q2: The other question I have is, how would you handle 'DELETES' on the source system? Q3: As we approach Day30, or Day365 etc. the main EXT table would be huge. Is there a way to kind of 'Reset' that base table at some point so it doesn't grow every time?

@nareshj6370 2 года назад

I have the same question:)

@aa-kj9zm 4 года назад

seems real time work.I am learning Hadoop but lost my way because i am not taking any training.This is very helpful. I will check all your videos.Thanks for this awesome video

@subramanianchenniappan4059 4 года назад

i am a java developer with hadoop handson. i will see all your videos , thanks for your help

@hemanthreddykolli 3 года назад

This video is very helpful to understand the CDC concept. thanks for sharing your knowledge.

@pravinmahindrakar6144 3 месяца назад

Thanks a lot, I think we can use row_number window function to get updated records by using partitions by emp_id and order by date desc. Finally can filter for row_number=1

@ramkumarananthapalli7151 Год назад

Quite useful!! Thank you for making 💐💐

@RaviKumar-uu4ro 5 лет назад

Tons of Thanks to your valuable videos. Really marvelous and uncomparable to any other.

@gauravpathak7017 4 года назад

just wow! This is the best that anyone can have on incremental Load in Hive. cheers :)

@narasimharao3665 3 года назад

If we do like this every time that duplicate records will be there in located file and that file size is extremely increased and whenever we run that view there are sub queries in that and it is also decrease the performance. instead of this we can use sqoop ("sqoop incremental option") to import the incremental data into hdfs directory file or cloud(like aws s3).

@sunshinemoon922 2 года назад

Awesome video sir. Very useful for interviews. Thank you very much.

@svdfxd 5 лет назад

I am preparing for Big Data interviews and such interview series would be really helpful. Please add Spark interview questions as well. The way you explained patiently with example is really good.

@GKCodelabs 5 лет назад

Sure, My interview series will cover a wide range of interview questions in all BD technologies. Hive, Spark, HBase, and Datawarehousing concepts will be a major part of those, as these are the most important skills in demand for most of the interviews. #KeepWatching :)

@udaynayak4788 2 года назад

thank you so much for detailed explanation

@ajaythedaredevil7220 4 года назад

1.Can we use merge statement for simplification? 2.what if a employee id has been deleted in new data set and now we don’t want it in our final table. I can see the join will take the left employee id also Many thanks!

@abhiganta 4 года назад

I have the same doubt.. What if some of the records are deleted from source db and we need to remove those records in hive ?

@arindampatra6283 4 года назад

That means the new data has all the employees info and you can simply filter the latest date😊

@ririraman7 2 года назад

Brilliant video! much needed...to the point!

@adityapratapsingh7649 3 года назад

Thanks for detailed video. I have one question we can do the same with window functions right like using row_number(). So which approach is the optimized one? select * from (select *, row_number() over (partition by id order by modifiedDate) as rk from v1) a where rk=1

@dhivakarsathya3918 2 года назад

I would prefer to use group by and inner join which GK as used which runs much faster than window functions in hive. Better to follow sqoop import if possible else hdfs storage size would become massive and ur view process will take lot of time

@rajnimehta9189 3 года назад

Awesome explanation

@sourav7413 3 года назад

thanks for your good informative video ...

@astropanda1623 10 месяцев назад

Very good explanation

@ajinkyahatolkar6518 Год назад

quite informative video. thanks !

@electricalsir 5 месяцев назад

Thanks man your are amazing 😍❤❤❤

@bsrameshonline 4 года назад

very good explanation on incremental loading

@bobbyvenkatesan3657 4 года назад

Thanks for sharing this kind of videos. Very helpful

@sumitkumarsahoo 3 года назад

Thanks alot! I was actually looking for something like this for loading incremental data

@bigdatabites6551 2 года назад

Great job Bro....

@naveenvinayak1088 4 года назад

well explained..helping others without expectations

@puneetbhatia 2 года назад

explained amazingly. thank you so much!!!!!!!!

@debatrii 3 года назад

Very good..well explained..thanks

@tallaravikumar4560 2 года назад

Good explanation but Ur text application is unclear atleast full black with white font would been more clearer,what if the modified data is not updated.

@sagarsinghrajpoot6788 4 года назад

I got this real time case . Thanks :) Now we got here how to handle incremental data but do you have have any video on a different use case - "Data Transofrmation usecase" using Hive ( applied business Transformations )? if yes please tell me . I became fan of you man .now onward i will also do practise like this by my own ....

@kilarivenkatesh9844 3 года назад

Nice explanation.. Bro

@sumitkhandwekar6021 2 года назад

Amazing brooooo

@ArunKumar-gw2ux 4 года назад

Good explanation! But this wont work in enterprise level data. this is not a scalable solution. For instance, if the incremental is maintained for 12 months and update is coming everyday, this deduping will take long time to complete

@kiranmudradi26 4 года назад

Can you please let us know better solution in such scenarios? Thanks

@ANUKARTHIM 4 года назад

Thanks for the video. Good work. Looking for more videos on Hbase and its stuff, how regions work in Hbase. How to define regions while creating hbase table and many more Bro.

@M-Fash0070 Год назад

Plz make one more vdo on rdbms to hive with maintaining history,updated data and new data....

@Sumit261990 3 года назад

Nice video, Really useful. Thanks a lot

@the_high_flyer 4 года назад

Super...thanks a ton for your video.

@anilpatil6783 5 лет назад

Thank you GK. This incremental data load is the basis millions of ETL jobs. Thank you for such a pitch perfect explaination. I have a question how this 9:40 > after logic is put into production? I mean to say how it actually made to run everyday. Here I could see view only. Is this view used to load data from staging to some other layer each day.

@rathnakarlanka2624 3 года назад

Thank you GK. If we miss incremental data extract for couple of times and if we use max date to join then there is a scope of missing the records right. So, How come we overcome this problem?

@rakshithbs882 3 года назад

Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output

@manikandanl4909 3 года назад

we are not selecting all columns in S alias table. So we are joining with t1 alias table to get all columns

@junaidansari675 4 года назад

Very helpful, please make videos about other components n theory, hadoop admin jobs related videos...

@rajeshkumardash611 3 года назад

@GK Codelabs This may not work if data has deleted records

@deepikakumari5369 4 года назад

nice explanation, please upload a video on "How to handle multiple small files generating hive as output?".. Thank you :)

@Kutub2005 2 года назад

Hello GK Codelabs, thanks for this awesome video. Would you please make a video to add column for where modified data will be refect ? Senerio is I don't have modified_date column in existing hive table so if I want to use the stratagy that you have shown in this video, then how do I add modified_date column in existing hive table and hdfs data ??

@gandlapentasabjan9115 3 года назад

Hi Bro, Very helpful videos thank you so much for sharing this with us. I have a small doubt, If suppose we don't have date column, then how to do?

@vru5696 3 года назад

Awsome video... Any videos related to SCD and SCD revert in Hive? Please share link.

@sagarsinghrajpoot6788 4 года назад

you are awesome man ;) I liked your vidoes i feel like i am watching like netfix easy to understand :)

@arunkumar-th8vy 3 года назад

can you please help me if we dont have any date column and after loading day2 into my history table(day1) i need to make it doesn't contain any duplicates

@prashantahire143 5 лет назад

Good Explanation ! How to perform incremetal hive load from the HDFS for the Partition Table? The table do not have the date/timestamp column

@GKCodelabs 5 лет назад

Hi Prashant, Thanks for your comment.. ☺️ We can use many other internal checkpoints in such cases. Thanks for sharing the scenario, i will surely explain this in one of my coming videos.. #KeepWatching

@arindampatra6283 4 года назад

I think you would have got your answers by now? If not then let's discuss. What is the partition column? How are you loading new data in that table?

@mahesh.h1b339 11 месяцев назад

@@GKCodelabs what type of join is this bro ?

@MrManish389 5 лет назад

Sir, easy and simple explained

@mahammadshoyab9717 4 года назад

Dont think asking much, If you explain us one end to end scenario from kafka pulling to hdfs landing and hive loading.its very helpful for the persons who are poor and struggling to clear interview?

@naveengupta7268 4 года назад

Great Explanation!

@abhiganta 4 года назад

Hi Sir, What is the need to create t2 ? We can directly directly query as (select empid, max(modDate) from inc_table group by empid) s and then join t1 and S ? Please correct me if wrong.

@vermad6233 Год назад

Same question from my side!

@ravikumark6746 4 года назад

Great sir..thank you

@akshaychoudhari5641 Год назад

how incremental load in Hive is different than the incremental load we do in scoop operation. can you explain

@sudhakarsubramani1528 Год назад

thank you

@rajeshreddy906 4 года назад

Hi your videos are gr8. If you don't mind could you please post a video on sort merge bucket ( SMB ) join

@kumarraja4759 4 года назад

Nice explanation, But I have question here . The final join query is required to pick the latest records? We can select all the columns with max(modified_date) will give the desired output i believe. correct me if i wrong

@manikandanl4909 3 года назад

Bro, when do aggregation by some column(here mod_date) we need to specify all other columns in "group by". In case if have 100s of column then we need to specify everything. Thats why joined with original table.

@ririraman7 2 года назад

@@manikandanl4909 Thank you for clearing the doubt brother!

@Sagar-gi5zq Год назад

what if we dont have modified_date column...?

@ravikirantuduru1061 4 года назад

Can you create videos on spark join if data is skewed and if joining small data and large data and how to do joins and explain how spark does sort merge and shuffle join.

@PramodKhandalkar5 3 года назад

thanks man

@nlaxman5091 4 года назад

hi bro please do one vedio on how to choose memory , core ,excutores in spark cluster

@ambikaprasadbarik6400 3 года назад

Thanks a lot!

@arunsakkumar8463 3 года назад

Please post top most interview questions for hive

@tarunreddy5917 Год назад

@GK Code labs , Is there any difference between incremental data and delta data

@routhmahesh9525 3 года назад

Can you please make a video on handling small file in Apache spark?

@snagendra5415 Год назад

Could you do a video on the small file problem

@vikky7480 4 года назад

please make a video on accumulators and broadcast variables. as well as aggregate by key() with a good example.

@GKCodelabs 4 года назад

Awesome Vicky, somehow you cracked what,the next video is going to be about ☺️☺️, its the very next video,which you Requested.. coming soon (couple of days)☺️☺️☺️💐

@swapnilpatil1422 Год назад

I was asked this question twice...

@bhushanmayank 5 лет назад

I have a doubt, how long are we going to store the daily files in hdfs ? Don't you think the performance of the view is going to be hit as more csv files are stored in the hdfs location to run a view on top of them ? Is there any way to keep only relevant records and a fresh file to process in hive and rest we move to cold storage ?

@souravsardar 4 года назад

@gkcodelabs can you please make some similar videos on pyspark with use cases asked in interview

@haranadhsanka9699 5 лет назад

Really appreciate , can you also explain spark way of incremental load Thanks in advance

@GKCodelabs 5 лет назад

Sure Haranadh, i will explain in one of my coming videos. KeepWatching ☺️

@rakshithbs882 3 года назад

@@GKCodelabs Hi, very nice explanation. I have one doubt. what if we use only S alias table query. Whether it will return same output

@deepikakumari5369 4 года назад

Sir, will you please give me answer to this? What approach we should take to load thousands of small 1 KB files using Hive, do we load one by one or should we merge together and load at once and how to do this?

@ririraman7 2 года назад

I believe hive is not meant for small files!

@ravishankarrallabhandi531 2 месяца назад

How can we handle the case where source records are closed / deleted ?

@saurav0777 4 года назад

Is this the implementation for SCD type 2 as well?

@rohitaute9928 Год назад

what if we need to maintain versions in hbase?

@richalikhyani7204 3 года назад

Do you have any course playlist.

@gsp4420 3 года назад

Hi, if we don't have sequence I'd and the CSV/table data contains only data with duplicates but the total combination of row values will be unique, how to do the incremental load in this situation, thank you

@gsp4420 3 года назад

And will not get load date and unique column is in source table and target table