Тёмный

52. Databricks| Pyspark| Delta Lake Architecture: Internal Working Mechanism 

Raja's Data Engineering
Подписаться 26 тыс.
Просмотров 43 тыс.
50% 1

Опубликовано:

 

1 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 89   
@farzicoderz
@farzicoderz 2 года назад
What a wonderful work you have done to help us understand not just theory but the practical look n feel also. Highly appreciate your efforts to create most valued content for us.
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thank you for your kind words, Ayushi!
@bhavanabh-o1h
@bhavanabh-o1h 10 месяцев назад
can you please share this notebook ?
@SidharthanPV
@SidharthanPV 2 года назад
I was wondering how delta lake handles ACID features then this video came.. thank you for making this!
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thank you
@UmerPKgrw
@UmerPKgrw Месяц назад
[2024-08-15 11:27 BST] Databricks Community Edition is very slow. Page are taking too much time to load. Internet speed is fine. Does anyone know why this is so.
@vaidhyanathan07
@vaidhyanathan07 4 месяца назад
I have a couple of questions for me .parquet file is not showing in _deltalog after inserting more than 10 records probably as you said may be admin setting would have be differently set .. 1) how to check the checkpoint file .parquet if it is hidden or not ? 2) if it is hidden how to view the file .. 3) if we use this command c = spark.conf.set("spark.databricks.delta.checkpointInterval", " 10") print(f"interval : "{c}") .. for me its shows "None"
@PinaakGoel
@PinaakGoel 12 дней назад
I have a doubt regarding update operation, you mentioned that delta engine scans for those particular files which have records that needs to updated and then updates on them, but if this the case, how time travel could be possible because updating existing files will result in loss of historical data.
@rajasdataengineering7585
@rajasdataengineering7585 12 дней назад
Parquet files are immutable in nature. So during update, relevant files are getting scanned and based on updated value new parquet files are getting created. It won't overwrite existing parquet files
@PinaakGoel
@PinaakGoel 11 дней назад
@@rajasdataengineering7585 Understood, thanks for your reply and kudos to your effort for compiling this databricks playlist!
@rajasdataengineering7585
@rajasdataengineering7585 11 дней назад
You are welcome!
@shakthimaan007
@shakthimaan007 2 месяца назад
Awesome work bro. Have you put these notebooks somewhere in your github? Can you share that with us if possible?
@ranjansrivastava9256
@ranjansrivastava9256 9 месяцев назад
Dear Raja, small request, for this interview series I can not see the videos are in the sequential order. Like I can not see video number 5,6,7,8,49,50,51 etc like that. If possible can you please help on this to arrange that?
@rajasdataengineering7585
@rajasdataengineering7585 9 месяцев назад
Hi Ranjan, those video numbers are ordered number for entire video list so missing in interview series as those topics are not part of interview questions
@purnimasharma9734
@purnimasharma9734 2 года назад
Very nice and helpful tutorials. The lectures are so good and to the point that I went through the entire series in a day. Learnt so much, thank you for posting these videos. I have become your follower and fan.
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thank you Purnima
@venkatasai4293
@venkatasai4293 2 года назад
Hi raja, delta table does not support bucketing . How can we achieve bucketing in delta table.also could you please make one detailed video on bucketing explaining about internals . When we create bucketing in hive the total number of files will be number of buckets but in spark it is different . Could you explain like how data is distributed on each node from two files . It will be great for us . Thank you.
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Hi venkata, yes delta tables does not support bucket but there are 2 workarounds 1. We can use delta table optimization Z-order which co-locates related data together same like bucket and improve the performance 2. We can write bucketed data in parquet file in a location and convert these parquet files to delta table Yes I can post a video on bucketing soon
@venkatasai4293
@venkatasai4293 2 года назад
@@rajasdataengineering7585 thank you for the info . In my requirement I have 1 fact table and 20 base tables . In this scenario which will be efficient bucketing or broadcasting . Since AQE is enabled it will prefer broadcast join . Also for this req which cluster instance will be efficient compute optimized or storage optimized ?
@venkatasai4293
@venkatasai4293 2 года назад
@@rajasdataengineering7585 also small doubt .let us assume I have a df1 with 5 bucketed files. Df2 with 5 bucketed files .so totally 10 files . I have 4 worker nodes . How the data distribution happens here . How it eliminates shuffling ?
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Hi Venkat, broadcast is suitable if your dim tables are tiny (around 10 mb size). Regarding cluster type, it is depending on what kind of operations you perform on these tables and size of these tables
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
In bucketing, both side tables are bucketed to specific key, which means data is already shuffled once based on key and sorted data is written to disc. So currently there are 5 bucketed files for each table and each one file will match with each one file other side. This means pre- sorted data is loaded into cluster memory and all relevant keys are on same executor memory for both tables. So no further shuffling needed and boosts the performance
@patriotbharath
@patriotbharath 2 года назад
Please provide code snippet and file which you are using for practice bro.Great content
@sravankumar1767
@sravankumar1767 2 года назад
In Delta tables how shd we know Day -0 full load and Day-1 Incremental Loading. In our project we Need create Day 0 & Day 1 pipelines separate. There is no Merge statement in our databricks notebook. How should we find whether it is Day 0 & Day 1. Could you please clarify my doubts
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Hi Sravan, I couldn't understand the requirement exactly. But I can guide based on what I understood. You have 2 different pipelines to populate data into delta table and later you want to know which pipeline got executed. For this scenario, let's better go with log messages. In the log output file, we can provide detailed information about the pipeline
@sravankumar1767
@sravankumar1767 2 года назад
@@rajasdataengineering7585 in the user story they mentioned Day 0 & Day 1 for ingestion and consumption pipelines. Here we didn't write te any merge statement for Day 1 .how shd we know in Delta tables whether it is delta load or incremental. Is there any specific field is available for delta load and incremental loading 🤔
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
No Sravan, there is no specific functionality to handle this scenario
@joyo2122
@joyo2122 2 года назад
@@sravankumar1767 First Time run should always be Full Load then incremental
@farzicoderz
@farzicoderz 2 года назад
If I get it write do you mean, you want to have a separate pipeline for loading data for first time (day 0 ) which contains data till current date from system or called historical load/full load. So you can have a separate pipeline to load full data. Then once you are done loading day 0, you would want to read incremental any data that came after your day 0 load you call it as day 1, bau data. So you can build a separate pipeline for it. It you don't want to have a merge sttatemt basically you want to keep all the data you read. Like say for day 0 you have id 1,2,3 now in day 1 if Id comes as 1,5,6 you want to store them all. Don't want to check if there's already an existing record for previously existing I'd like I'd 1 here. Is that what you are saying here? There is a concept of slowly changing dimension (scd1, scd2 etc) you should give it a read.
@kamatchiprabu
@kamatchiprabu Месяц назад
Clearly understand sir.. thanks
@rajasdataengineering7585
@rajasdataengineering7585 Месяц назад
Glad to hear that! You are welcome
@ravipaul1657
@ravipaul1657 2 года назад
Do we need delta table if we have synapse analytics and we are performing our ETL Task using Azure Daabricks.
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
It is not mandatory. But based on the complete requirement of project and recommended architecture, it can be decided
@sowmyakanduri-t8t
@sowmyakanduri-t8t 4 месяца назад
The lectures are very good but they are not organized properly. It covers more of pyspark in databricks. Not much about databricks.
@vinodhmani7773
@vinodhmani7773 Год назад
Thanks for all the videos sir. I have read few data engineering books/blogs in recent times, your sessions are way more detailed with practical knowledge. Thanks for taking time and doing it.
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Thanks and welcome. Glad it helps data engineers in the community
@Vishu-ru4iw
@Vishu-ru4iw 9 месяцев назад
Hi sir, the video is really helpful and clear but when i tried the same i got 00000000000000000001.00000000000000000006.compacted.json on the 10th execution instead of a checkpoint parquet file, can you please help with this
@pridename2858
@pridename2858 Год назад
This really eye opening video's.kindly keep doing it. this so full of knowledge. Great work.
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Thanks for your comment! Hope it helps to gain knowledge of delta internals. Sure, will keep creating more videos
@totnguyen3308
@totnguyen3308 Год назад
Thanks for your tutorial. I have a question about how to create another folder in DBFS like in video. I tried right clicking and creating folder but it didn't work.
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
You can create a folder using file system command %fs mkdirs or dbutils command dbutils.fs.mkdirs
@totnguyen3308
@totnguyen3308 11 месяцев назад
@@rajasdataengineering7585 Thank you, I did it.
@rajasdataengineering7585
@rajasdataengineering7585 11 месяцев назад
Great
@satijena5790
@satijena5790 Год назад
Very very nice explanation. I am already a fan of Raja's data engineering channel. Just wondering whether i can get a copy of this notebook for practice please ???
@andre__luiz__
@andre__luiz__ 11 месяцев назад
Thank you for this video and it's amazing content!!!!
@rajasdataengineering7585
@rajasdataengineering7585 11 месяцев назад
Glad you enjoyed it!
@battulasuresh9306
@battulasuresh9306 2 года назад
It's so helpful if lectures are arranged in an order
@ashishbarwad9471
@ashishbarwad9471 2 месяца назад
BEST MEANS BEST VIDEO EVER IF YOU ARE INTERESED TO LEARN THEN .
@rajasdataengineering7585
@rajasdataengineering7585 2 месяца назад
Thank you
@midhunrajaramanatha5311
@midhunrajaramanatha5311 2 года назад
Can you provide link to access the notebooks in the video description which will be very useful
@sangeetharamakrishnan6288
@sangeetharamakrishnan6288 2 года назад
Really helpful video...Thank you very much indeed
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thanks!
@demudunaidugompa
@demudunaidugompa 11 месяцев назад
Great content and very helpful. thank You so much for sharing valuable content.
@rajasdataengineering7585
@rajasdataengineering7585 11 месяцев назад
Glad it was helpful!
@joyo2122
@joyo2122 2 года назад
The functionality of Time Travel of Tables is Awesome
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Very true
@padmavathyk1538
@padmavathyk1538 Год назад
Could you please post the queries which you used in the video?
@adityaarbindam
@adityaarbindam 3 месяца назад
excellent explanation Raja ..very insightful
@rajasdataengineering7585
@rajasdataengineering7585 3 месяца назад
Glad you liked it! Keep watching
@adityaarbindam
@adityaarbindam 3 месяца назад
is it you Kartik ? i am guessing because of the way you uses notepad++🙂
@rajasdataengineering7585
@rajasdataengineering7585 3 месяца назад
No, this is Raja
@kelvink6470
@kelvink6470 Год назад
Explained very well. Thank you.
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Glad you liked it!
@niteshsoni2282
@niteshsoni2282 Год назад
GREAT SIR....LOVED
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Thank you
@tejashrikadam7704
@tejashrikadam7704 Год назад
You are doing great work😊
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Thanks! Hope it helps data engineering community
@sravankumar1767
@sravankumar1767 2 года назад
Nice explanation Raja 👌 👍 👏
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thank you Sravan!
@sapkyoshi
@sapkyoshi Год назад
What are all the slashes for can anyone tell?
@Umerkhange
@Umerkhange Год назад
superb
@ravikumar-sz1je
@ravikumar-sz1je 2 года назад
Very good explanation
@rajasdataengineering7585
@rajasdataengineering7585 2 года назад
Thank you
@sureshrecinp
@sureshrecinp Год назад
thank you for the info
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Any time!
@Umerkhange
@Umerkhange Год назад
When working on big projects do you create any framework or just use the spark core APIs
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
There is no standard framework. Depending on the use case, framework is designed
@UmerPKgrw
@UmerPKgrw Год назад
@@rajasdataengineering7585 it will really great if u make a video on framework. There is a channel on youtube which explains development through framework but his video are very boring. Ref Dirtylab
@UmerPKgrw
@UmerPKgrw Год назад
@@rajasdataengineering7585 Datyrlab
@tanushreenagar3116
@tanushreenagar3116 Год назад
Best tutorial 👌
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Glad it helped
@aravind5310
@aravind5310 Год назад
Nice content.
@rajasdataengineering7585
@rajasdataengineering7585 Год назад
Thanks!
@harshadeep7506
@harshadeep7506 8 месяцев назад
Nice one
@rajasdataengineering7585
@rajasdataengineering7585 8 месяцев назад
Thanks for watching