What a wonderful work you have done to help us understand not just theory but the practical look n feel also. Highly appreciate your efforts to create most valued content for us.
[2024-08-15 11:27 BST] Databricks Community Edition is very slow. Page are taking too much time to load. Internet speed is fine. Does anyone know why this is so.
I have a couple of questions for me .parquet file is not showing in _deltalog after inserting more than 10 records probably as you said may be admin setting would have be differently set .. 1) how to check the checkpoint file .parquet if it is hidden or not ? 2) if it is hidden how to view the file .. 3) if we use this command c = spark.conf.set("spark.databricks.delta.checkpointInterval", " 10") print(f"interval : "{c}") .. for me its shows "None"
I have a doubt regarding update operation, you mentioned that delta engine scans for those particular files which have records that needs to updated and then updates on them, but if this the case, how time travel could be possible because updating existing files will result in loss of historical data.
Parquet files are immutable in nature. So during update, relevant files are getting scanned and based on updated value new parquet files are getting created. It won't overwrite existing parquet files
Dear Raja, small request, for this interview series I can not see the videos are in the sequential order. Like I can not see video number 5,6,7,8,49,50,51 etc like that. If possible can you please help on this to arrange that?
Hi Ranjan, those video numbers are ordered number for entire video list so missing in interview series as those topics are not part of interview questions
Very nice and helpful tutorials. The lectures are so good and to the point that I went through the entire series in a day. Learnt so much, thank you for posting these videos. I have become your follower and fan.
Hi raja, delta table does not support bucketing . How can we achieve bucketing in delta table.also could you please make one detailed video on bucketing explaining about internals . When we create bucketing in hive the total number of files will be number of buckets but in spark it is different . Could you explain like how data is distributed on each node from two files . It will be great for us . Thank you.
Hi venkata, yes delta tables does not support bucket but there are 2 workarounds 1. We can use delta table optimization Z-order which co-locates related data together same like bucket and improve the performance 2. We can write bucketed data in parquet file in a location and convert these parquet files to delta table Yes I can post a video on bucketing soon
@@rajasdataengineering7585 thank you for the info . In my requirement I have 1 fact table and 20 base tables . In this scenario which will be efficient bucketing or broadcasting . Since AQE is enabled it will prefer broadcast join . Also for this req which cluster instance will be efficient compute optimized or storage optimized ?
@@rajasdataengineering7585 also small doubt .let us assume I have a df1 with 5 bucketed files. Df2 with 5 bucketed files .so totally 10 files . I have 4 worker nodes . How the data distribution happens here . How it eliminates shuffling ?
Hi Venkat, broadcast is suitable if your dim tables are tiny (around 10 mb size). Regarding cluster type, it is depending on what kind of operations you perform on these tables and size of these tables
In bucketing, both side tables are bucketed to specific key, which means data is already shuffled once based on key and sorted data is written to disc. So currently there are 5 bucketed files for each table and each one file will match with each one file other side. This means pre- sorted data is loaded into cluster memory and all relevant keys are on same executor memory for both tables. So no further shuffling needed and boosts the performance
In Delta tables how shd we know Day -0 full load and Day-1 Incremental Loading. In our project we Need create Day 0 & Day 1 pipelines separate. There is no Merge statement in our databricks notebook. How should we find whether it is Day 0 & Day 1. Could you please clarify my doubts
Hi Sravan, I couldn't understand the requirement exactly. But I can guide based on what I understood. You have 2 different pipelines to populate data into delta table and later you want to know which pipeline got executed. For this scenario, let's better go with log messages. In the log output file, we can provide detailed information about the pipeline
@@rajasdataengineering7585 in the user story they mentioned Day 0 & Day 1 for ingestion and consumption pipelines. Here we didn't write te any merge statement for Day 1 .how shd we know in Delta tables whether it is delta load or incremental. Is there any specific field is available for delta load and incremental loading 🤔
If I get it write do you mean, you want to have a separate pipeline for loading data for first time (day 0 ) which contains data till current date from system or called historical load/full load. So you can have a separate pipeline to load full data. Then once you are done loading day 0, you would want to read incremental any data that came after your day 0 load you call it as day 1, bau data. So you can build a separate pipeline for it. It you don't want to have a merge sttatemt basically you want to keep all the data you read. Like say for day 0 you have id 1,2,3 now in day 1 if Id comes as 1,5,6 you want to store them all. Don't want to check if there's already an existing record for previously existing I'd like I'd 1 here. Is that what you are saying here? There is a concept of slowly changing dimension (scd1, scd2 etc) you should give it a read.
Thanks for all the videos sir. I have read few data engineering books/blogs in recent times, your sessions are way more detailed with practical knowledge. Thanks for taking time and doing it.
Hi sir, the video is really helpful and clear but when i tried the same i got 00000000000000000001.00000000000000000006.compacted.json on the 10th execution instead of a checkpoint parquet file, can you please help with this
Thanks for your tutorial. I have a question about how to create another folder in DBFS like in video. I tried right clicking and creating folder but it didn't work.
Very very nice explanation. I am already a fan of Raja's data engineering channel. Just wondering whether i can get a copy of this notebook for practice please ???
@@rajasdataengineering7585 it will really great if u make a video on framework. There is a channel on youtube which explains development through framework but his video are very boring. Ref Dirtylab