small file problem in Hadoop? According to me if we have lots of small files in cluster that will increase burden on namenode . bcoz namenode stores the meta data of file so if we have lots of small files name node keep noting address of files and hence if master down cluster also gone down.
That is right... In addition to this spark will also need to create more executor tasks... This will create unnecessary overhead and slow down your data processing
Where there are lot of small files in hadoop, the namenode performance can be impact because of unable to fast process the data.. Actually Hadoop is for handling big data.. So creating too many small files may end up with namenode performance impact. I came across this problem in my project
This is nice explanation, But you are considering physical partition for hive , but memory level partition for spark to show difference no of files generated
Really appreciate @Data Savvy for the effort. I have a question: The data searching/retrieval process in case of partitioned table can (to create an analogy) we understand, the way element retrieval is done in binary tree and in case of partitioned bucketed table, a way search is done in nested binary tree . I am referring to Binary tree in Data structure Recently, I followed one Mock Bigdata Interview video in your channel,liked a lot. If possible please upload a few more such videos. Thanks :)
@@DataSavvy The way data is retrieved / searched in partitioned hive table, can we think / correlate the same with that of element retrieve in case of binary tree (Binary Tree in Data Structure). Not sure if this is a better version :)
Thanks for a very helpful video. My question here is, how we can perform optimisation using bucketing,? As in bucketing data is shuffled among different buckets, so it will not be sorted, so if i am using where condition over bucketed table how should i avoid irrelevant bucket scans like i do in partitioning? In short does where condition optimises bucketed table if not then what are other optimisations over bucketing ?
Sir, could you please give one example syntactically between Hive partition, bucketing vs spark partition, bucketing . And couldn't understand the last point of your summary, could you please give some more clarity on it .
No of bucket in spark = size of data /128 Iam I correct so in that case as above we can't specify no of buckets in spark ? In which case should we go for bucketing and which case should go for partitioning can you give some example ?
I ll tell you one thing here. Partitioning is done based upon the column & bucketing is done based upon the rows. (i.e., both concepts are splitting data into multiple pieces. But part based on column and buck based on rows/records.) Suppose if we have data 1-100 .we can bucket data like 1-25 in one bucket and 25 -50 in second bucket and 50-75 &75-100respectively. Based on rows. But partation is based on column. Ex. If you have column name (population in year wise from 2010-2020) we split data based on year wise . 2010 ,2011,2012...2020into 10 partations. If it is 100%correct .please comment some one. Dont feel bad. If im wrong i make it correct. Tq
Partitioning and bucketing both are done one column... only diff is , How the records are grouped. I think your statement is right but u are viewing these concepts in more complex way..
How can I find if my bucketing was really utilized by the query? Can be visible from the physical plan? Also, I am believing that in the case of partition+bucketing, both the partition and bucket filters should be on my query?
Hi Harjeet..... Thanks for such informative video. One qq here U choose country column for partition that's ok And u choose age column for buckets. So here why did u choose age column for bucketing ? Why not Name column ? Or we can choose any from name and age or there is some technicality behind to choose bucketing column ? If yes plz do comment.
it depends on the filter you want to apply, if you want to apply filter on age and you are bucketing by name, then the problem will remain as it is and it won't make any sense.
what kind of problems we will face when there are a lot of small files in hadoop? My ans is : Hadoop is meant for handling large size of files in less number. i.e , hadoop can handle big size files with less count. hadoop wont give better results in efficient way for lot of small files, because there sould be SEEK time for reading data from hard disk to fetch a record . this would increase if you use lot of small files, it will increase system down time. and more over meta data also increases.
@data savvy, i obesrved in my local system with multiple cores, partitionBy and bucketBy both doesn't perform any shuffle, there is no exchange in plan. That is why it is producing small files in both cases? Is that right? Will it perform shuffle in large cluster? I am jts reading from a file and writing in partitionby or bucket by no transformations, tell me in this case cluster level also no shuffle will be there?
When partition on a column will create small files, use bucketing without partition.. before doing sort merge join also u can create buckted table and improve performance of join
@@DataSavvy Thank you sir for answer. If I used 4 buckets, when I hit select query then it will go to only one specific bucket or it will search in all buckets? Because in partition we have folders with value, in case of bucketing, how query will know , in which bucket to search ?
but what i herd is in spark 1 partition = 1 block size.... partitions are not created like in hive using specific column name again here in spark when comes to bucketing..as u said 1 bucket should be minimum of block size....so is it mean 1 bucket = 1 partition...then what is the need of bucketing in spark...im confused
is bucketing not used with save() method ? it works fine with saveAsTable() getting this error AnalysisException: 'save' does not support bucketBy and sortBy right now.
Thanks, Harjeet. It was a great explanation. Quick question for you - What will happen if we remove a partition key after loading the data (in managed and external tables)?
How can u remove partition key once table is created? If u drop and recreate table without partition, data present in physical location of table cannot be read by table. It will give parsing exception
If we want to query the table for country= india and age=20. Now that we have create new bucketed table, do we have to query bucketed table or initial table. Little lost here.