Тёмный

Delta Change Feed and Delta Merge pipeline (extended demo) 

Dustin Vannoy
Подписаться 2,9 тыс.
Просмотров 2 тыс.
50% 1

Опубликовано:

 

20 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 6   
@capoeiracordoba
@capoeiracordoba Год назад
Hi Dustin, great job, do you have any notebook example in Github repository? Thz!!
@DustinVannoy
@DustinVannoy Год назад
github.com/datakickstart/azure-data-engineer-databricks/blob/main/best_of_class_recruiting/nb_refined_table_load.py
@ThisIsFrederic
@ThisIsFrederic 11 месяцев назад
Hi Dustin, Thank you so much for sharing this demo with us. While trying to adapt it to my environment (I am using Synapse), I am facing an issue that I hope you could help me resolve: when the target delta table does not exist, I noticed that after I create it, CDF shows being enabled only with version 1 and not 0. The initial version 0 is for the initial WRITE only, no CDF enabled. Consequently, I cannot use your trick to load everything from version 0 if the table does not exist. I tried to use the "SET spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;" but Synapse seems to ignore it completely. I also tried to include the option of enabling CDF while saving the delta table like shown below, but again, CDF gets only enabled with version 1: df_records.write.format('delta').option("delta.enableChangeDataFeed", "true").save(target_path) Any clue? Thanks!
@ThisIsFrederic
@ThisIsFrederic 11 месяцев назад
Well, I just discovered that when you create a delta table, adding option("delta.enableChangeDataFeed", "true") is not enough. When creating the temnp view to switch to SQL, then you also need to add the delta.enableChangeDataFeed = true option to the TBLPROPERTIES when issuing the CREATE OR REPLACE TABLE statement, and this works. Still, the question about enabling by default CDF in Synapse remains, if ever you have a clue. Thanks!
@gardnmi
@gardnmi Год назад
If you use the spark structured streaming for batch processing you can just use the delta tables themselves as sinks and you don't have to bother with keeping track of the state of the table yourself. There is a some good documentation on databricks if you just search "Delta table as a sink". My current go to pattern is append only ingest for bronze and then do a streaming merge into silver turning on the change data feed for that table. In the gold layer you can then read the change data feed which is append only as well and provide the cdc updates to the gold aggregates.
@felipecastro3710
@felipecastro3710 Год назад
Hi! I am doing the same process as you, for bronze and silver ingestion. About using CDF for gold layer, won't I need to keep checkpoints of the versions I have already loaded? Getting the MAX(last_modified) seems like a heavy operation on big tables. Imagining a daily run, how are you usually filtering data when querying the CDF, to only merge data that should actually be merged?
Далее
Advancing Spark - Databricks Delta Change Feed
17:01
Просмотров 14 тыс.
Angry bird PIZZA?
00:20
Просмотров 6 млн
34.  Change Data Feed Demo 02
12:55
Просмотров 6 тыс.
Apache Spark DataKickstart - Introduction to Spark
15:16
Monitoring Databricks with System Tables
16:05
Просмотров 2,8 тыс.
Advancing Spark - Identity Columns in Delta
20:00
Просмотров 10 тыс.
Change Data Feed in Delta
19:53
Просмотров 9 тыс.
What does a Data Analyst actually do? (in 2024) Q&A
14:27
Angry bird PIZZA?
00:20
Просмотров 6 млн