Dustin Vannoy

Dustin Vannoy

53
157 317

Подписаться

This channel is a resource for you to learn about modern data technologies and practices from a Data Engineer perspective, from kickstart tutorials to tips to go to the next level.

Dustin Vannoy is a consultant in data analytics and engineering. His specialties are modern data pipelines, data lakes, and data warehouses. He loves to share knowledge with the data engineering and science community.

Introducing DBRX Open LLM - Data Engineering San Diego (May 2024)

1:09:53

Introducing DBRX Open LLM - Data Engineering San Diego (May 2024)

Месяц назад

Monitoring Databricks with System Tables

16:05

Monitoring Databricks with System Tables

5 месяцев назад

Databricks Monitoring with Log Analytics - Updated for DBR 11.3+

17:32

Databricks Monitoring with Log Analytics - Updated for DBR 11.3+

6 месяцев назад

Databricks CI/CD: Intro to Databricks Asset Bundles (DABs)

20:00

Databricks CI/CD: Intro to Databricks Asset Bundles (DABs)

9 месяцев назад

Data + AI Summit 2023: Key Takeaways

11:46

Data + AI Summit 2023: Key Takeaways

Год назад

PySpark Kickstart - Read and Write Data with Apache Spark

29:57

PySpark Kickstart - Read and Write Data with Apache Spark

Год назад

Spark SQL Kickstart: Your first Spark SQL application

19:12

Spark SQL Kickstart: Your first Spark SQL application

Год назад

PySpark Kickstart - Your first Apache Spark data pipeline

37:14

PySpark Kickstart - Your first Apache Spark data pipeline

Год назад

Spark Environment - Azure Databricks Trial

8:33

Spark Environment - Azure Databricks Trial

Год назад

Spark Environment - Databricks Community Edition

5:44

Spark Environment - Databricks Community Edition

Год назад

Apache Spark DataKickstart - Introduction to Spark

15:16

Apache Spark DataKickstart - Introduction to Spark

Год назад

Unity Catalog setup for Azure Databricks

9:40

Unity Catalog setup for Azure Databricks

Год назад

Visual Studio Code Extension for Databricks

8:40

Visual Studio Code Extension for Databricks

Год назад

Parallel Load in Spark Notebook - Questions Answered

30:23

Parallel Load in Spark Notebook - Questions Answered

Год назад

Delta Change Feed and Delta Merge pipeline (extended demo)

26:54

Delta Change Feed and Delta Merge pipeline (extended demo)

Год назад

Data Engineering SD: Rise of Immediate Intelligence - Apache Druid

59:56

Data Engineering SD: Rise of Immediate Intelligence - Apache Druid

2 года назад

Azure Synapse integration with Microsoft Purview data catalog

17:02

Azure Synapse integration with Microsoft Purview data catalog

2 года назад

Adi Polak - Chaos Engineering - Managing Stages in a Complex Data Flow - Data Engineering SD

1:05:14

Adi Polak - Chaos Engineering - Managing Stages in a Complex Data Flow - Data Engineering SD

2 года назад

Azure Synapse Spark Monitoring with Log Analytics

10:30

Azure Synapse Spark Monitoring with Log Analytics

2 года назад

Parallel table ingestion with a Spark Notebook (PySpark + Threading)

12:33

Parallel table ingestion with a Spark Notebook (PySpark + Threading)

2 года назад

SQL Server On Docker + deploy DB to Azure

20:26

SQL Server On Docker + deploy DB to Azure

2 года назад

Michael Kennedy - 10 tips for developers and data scientists - Data Engineering SD

1:21:54

Michael Kennedy - 10 tips for developers and data scientists - Data Engineering SD

2 года назад

Synapse Kickstart: Part 5 - Manage Hub

3:22

Synapse Kickstart: Part 5 - Manage Hub

2 года назад

Synapse Kickstart: Part 4 - Integrate and Monitor

7:46

Synapse Kickstart: Part 4 - Integrate and Monitor

2 года назад

Synapse Kickstart: Part 3 - Develop Hub (Spark/SQL Scripts)

7:46

Synapse Kickstart: Part 3 - Develop Hub (Spark/SQL Scripts)

2 года назад

Data Lifecycle Management with lakeFS - Data Engineering SD

1:14:34

Data Lifecycle Management with lakeFS - Data Engineering SD

2 года назад

Synapse Kickstart: Part 2 - Data Hub and Querying

4:51

Synapse Kickstart: Part 2 - Data Hub and Querying

2 года назад

Synapse Kickstart: Part 1 - Overview

4:59

Synapse Kickstart: Part 1 - Overview

2 года назад

Scheduling Synapse Spark Notebooks

9:47

Scheduling Synapse Spark Notebooks

2 года назад

Комментарии

@usmanrahat2913 7 дней назад

How do you enable intellisense?

@dreamsinfinite83 12 дней назад

how do you change the Catalog Name specific to an environment?

@dhananjaypunekar5853 23 дня назад

Thansk for the explanation! Is there any way to view exported DBC files in VS code?

@NoahPitts713 Месяц назад

Exciting stuff! Will definitely be trying to implement this in my future work!

@etiennerigaud7066 Месяц назад

Great video ! Is there a way to overide variables defined in the databricks.yml in each of the job yml definition so that the variable has a different value for that job only ?

@ameliemedem1918 Месяц назад

Thanks a lot, @DustinVannoy for this great presentation! I have a question: which is the better approach for project structuration: one bundle yml config file for all my sub-projects or each sub-project have its own Databricks and bundle yml file? Thanks again :)

@9829912595 Месяц назад

Once the code is deployed it gets uploaded in the shared folder can't we store that some where else like an artifact or storage account because there are chances that someone may deleted that bundle from shared folder. It is always like with databricks deployment before and after asset bundles.

@DustinVannoy Месяц назад

You can set permissions on the workspace folder and I recommend also having it all checked into version control such as GitHub in case you ever need to recover an older version.

@fortheknowledge145 Месяц назад

Can we integrate Azure pipelines + DAB for ci cd implementation?

@DustinVannoy Месяц назад

Are you referring to Azure DevOps CI pipelines? You can do that and I am considering a video on that since it has been requested a few times.

@fortheknowledge145 Месяц назад

@@DustinVannoy yes, thank you!

@felipeporto4396 6 дней назад

@@DustinVannoy Please, can you do that? hahaha

@gardnmi Месяц назад

Loving bundles so far. Only issue so far I've had is the databricks vscode extension seems to be modifying my bundles yml file behind the scenes. For example when I attach to a cluster in the extension it will override my job cluster to use that attached cluster when I deploy to the dev target in development mode.

@DustinVannoy Месяц назад

Which version of the extension are you on, 1.3.0?

@gardnmi Месяц назад

@@DustinVannoyYup, I did have it on a pre release which I thought was the issue but switched back to 1.3.0 and the "feature" persisted.

@maoraharon3201 Месяц назад

Hey, Great video! Small question, Why not just using the FAIR scheduler that doing that automatically?

@TheDataArchitect Месяц назад

Can delta sharing works with hive_metastore?

@shamalsamal5461 2 месяца назад

thanks so much for your help

@Sundar25 2 месяца назад

Run driver program using multithreads using this as well. from threading import * # import threading from time import * # for demonstration we have added time module workerCount = 3 # number to control the program using threads def display(tablename): # function to read & load tables from X schema to Y Schema try: #spark.table(f'{tablename}').write.format('delta').mode('overwrite').saveAsTable(f'{tablename}'+'target') print(f'Data Copy from {tablename} -----To----- {tablename}_target is completed.') except : print("Data Copy Failed.") sleep(3) list = ['Table1','Table2','Table3','Table4','Table5', 'Table3', 'Table7', 'Table8'] # list of tables to process tablesPair = zip(list,list) # 1st list used for creating object & 2nd list used as table name & thread name counter = 0 for obj,value in tablesPair: obj = Thread(target=display, args=(value,), name=value) # creating Thread obj.start() # Starting Thread counter += 1 if counter % workerCount == 0: obj.join() # Hold untill 3rd Thread completes counter = 0

@KamranAli-yj9de 3 месяца назад

Hey Dustin, Thanks for the tutorial! I've successfully integrated the init script and have been receiving logs. However, I'm finding it challenging to identify the most useful logs and create meaningful dashboards. Could you create a video tutorial focusing on identifying the most valuable logs and demonstrating how to build dashboards from them? I think this would be incredibly helpful for myself and others navigating through the data. Looking forward to your insights!

@DustinVannoy 3 месяца назад

This is what I have plus the related blog posts. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-92oJ20XeQso.htmlsi=OS-WZ_QrL-_kkwWu We mostly used out custom logs for driving dashboards but also evaluated some of the heap memory metrics regularly as well.

@KamranAli-yj9de 2 месяца назад

@@DustinVannoy Thank you. It means a lot :)

@isenhiem 3 месяца назад

Hello Dustin, Thank you for posting this video. This was very helpful!!! Pardon my ignorance but I have a question about initializing the Databricks bundle. The first step when you initialize the databricks bundle through CLI, does it create the required files in the databricks workspace folder. Additionally do we push the files from the databricks workspace to our git feature branch so that we can clone it to your local so that we can make the change in the configurations and push it back to git for deployment.

@KamranAli-yj9de 3 месяца назад

Hello, sir, Thank you for this tutorial. I successfully integrated with log analytics. Could you please show me what we can do with these logs and how to create dashboards? I am eagerly awaiting your response. Please guide me.

@rakeshsekar3840 29 дней назад

please do this episode

@chrishassan8766 3 месяца назад

Hi Dustin, Thank you for sharing this approach I am going to use it for training spark ml models. I had a question on using daemon option. My understanding is that these threads will never terminate until a script ends. When do they in this example? Do they terminate at the end of the cell? Or after .join()? So when all items in the queue have completed. I really appreciate any explanation you provide.

@rum81 3 месяца назад

Thank you for the session!

@Jolu140 3 месяца назад

Hi thanks for the informative video! I have a question, instead of sending a list to the notebook, I send a single table to the notebook using a for each activity (synapse can do maximum 50 concurrent iterations). What would the difference be? Which would be more efficient? And what is best practice in this case? Thanks in advance!

@vivekupadhyay6663 4 месяца назад

For CPU intensive operations would this work since it uses threading? Also, can't we use multiprocessing if we want to achieve parallelism?

@kyledukes9788 4 месяца назад

great content, thank you

@user-xz7pk9jk2u 4 месяца назад

It is creating duplicate jobs on re deployment of databricks.yml. How to avoid that?

@saurabh7337 4 месяца назад

is it possible to add approvers in asset bundle based code promotion ? Say one does not want the same dev to promote to prod, as prod could be maintained by other teams; or if the dev has to do cod promotion, it should go through an approval process. Also is it possible to add code scanning using something like sonarcube ?

@manasr3969 5 месяцев назад

Amazing content , thanks man. I'm learning a lot

@seansmith4560 5 месяцев назад

Like @gardnmi, I also used the map method threadpool has. Didn't need a queue. I created a new cluster (tagged for the appropriate billing category) and set the max workers on both the cluster and threadpool: from concurrent.futures import ThreadPoolExecutor with ThreadPoolExecutor(max_workers=137) as threadpool: s3_bucket_path = 's3://mybucket/' threadpool.map(lambda table_name: create_bronze_tables(s3_bucket_path, table_name), tables_list)

@vygrys 5 месяцев назад

Great video tutorial. Clear explanation. Thank you.

@slothc 6 месяцев назад

How long does it take to deploy the python wheel for you? For me it takes about 15 mins which makes me consider making wheel project separate from rest of the solution.

@DustinVannoy 6 месяцев назад

I am not currently working with Synapse but 15 minutes is too long if the wheel is already built and available to the spark pool for the install.

@user-lr3sm3xj8f 6 месяцев назад

I was having so many issues using the other Threadpool library in a notebook, It cut my notebook runtime down by 70% but I couldn't get it to run in a databricks job. Your solution worked perfectly! Thank you so much!

@willweatherley4411 6 месяцев назад

Will this work if you read in a file, do some minor transformations and then save to ADLS? Would it work if we add in transformations basically?

@DustinVannoy 6 месяцев назад

Yes. If the transformations are different per source table you may want to provide the correct transformation function as an argument also. Or have something like a dictionary that maps source table to transformation logic.

@antony_micheal 6 месяцев назад

Hi Dustin how can we send stderr logs into azure monitor

@DustinVannoy 5 месяцев назад

I'm not sure of a way to do this, but I haven't put too much time into it. I do not believe the library used in this video can do that, but if you figure out how to get it to write to log4j also then it will go to Azure Monitor / Log Analytics with the approach shown.

@suleimanobeid9995 7 месяцев назад

thanx alot for this video, but plz try to save the (almost dead) plant behind you :)

@DustinVannoy 7 месяцев назад

Great attention to detail! The plant has been taken care of😀

@himanshurathi1891 7 месяцев назад

Hey Dustin, Thank you so much for the video, I still have one doubt, I've been running a streaming query in a notebook for over 10 hours. The streaming query statistics only show specific time intervals. How can I view input rate, process rate, and other stats for different timings or for the entire 10 hours to facilitate debugging?

@DustinVannoy 6 месяцев назад

Check out how to use Query Listener from this video and see if that covers what you are after. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-iqIdmCvSwwU.html

@neerajnaik5161 7 месяцев назад

I tried this. However, I noticed a issue when I have single notebook which creates multiple threads, where each thread is calling a function which creates the spark localtempviews, the views get overwritten by the second thread as it essentially is same spark session. How do I get around this?

@DustinVannoy 7 месяцев назад

I would parameterize it so that each temp view has a unique name.

@neerajnaik5161 7 месяцев назад

@@DustinVannoyyea i had that in mind, unfortunately i cannot as the existing jobs are stable in production. However, this is definitely useful for new implementation

@neerajnaik5161 7 месяцев назад

I figured it. instead of calling the function i can use dbutils.notebook.run to invoke the notebook in seperate spark session. Thanks

@CodeCraft-ve8bo 7 месяцев назад

Can we use it for AWs databricks as well?

@DustinVannoy 7 месяцев назад

Yes, it works with AWS.

@xinosistemas 7 месяцев назад

Hi Dustin, great content, quick question, where can I find the library for Runtime v14 ?

@DustinVannoy 6 месяцев назад

Check out this video and the related blog for latest tested versions. It may work with 14 also but only tested with LTS runtimes. ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-CVzGWWSGWGg.html

@venkatapavankumarreddyra-qx2sc 8 месяцев назад

Hi Dustin. How to implement the same using scala. I tried but the same solution is not working for me. Any advise?

@NaisDeis 8 месяцев назад

How can i do this today on windows?

@DustinVannoy 8 месяцев назад

I am close to finalizing a video on how to do this for newer runtimes and i build it on windows this time. I use WSL to build this on windows. For Databricks Runtimes 11.3 and above there is a branch named l4jv2 that works.

@asuretril867 8 месяцев назад

Hey Dustin, Really appreciate the video on DAB's , If possible can you please make a video on using DAB's for CICD with Azure Devops. Thanks !

@arturassgrygelis3473 8 месяцев назад

Doing a lesson how to use spark, but even doesn't show how initialize it , another one bulshit teacher

@FlameOfAnor9 8 месяцев назад

Obrigado Dustin!!

@Druidofwind 8 месяцев назад

This is great. Have you compared pyspark.InheritableThread vs Python thread? I read that pyspark version will keep thread in sync between PVM and JVM

@ibrahimkhaleelullahshaik3506 8 месяцев назад

I'm unable to install that particular databricks ,says corruptziperror:central command signature not found,any help on this?

@NaisDeis 9 месяцев назад

Hi Dustin, i want to send a dataframe with streaming logs that im listening from an eventhub and send them to log analytics, but im no recieving any data on the log analytics workspace or azure monitor? which may be the problem? do i need to create a custom table before hand? DCR or MMA? I dont know why im not getting any data or what im doing wrong...

@DustinVannoy 6 месяцев назад

Is this still an issue? If so, is it related to using spark-monitoring library? I have a quick mention of how to troubleshoot that towards the end of this new video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-CVzGWWSGWGg.html

@TJ-hs1qm 9 месяцев назад

Only 2min in and he already lost me. 1:53 can't see the referenced screen 😆?! For future videos: it would be greatly appreciated if the necessary prerequisites could be at least listed in the description box. this -> ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-M7C-MyVHyrU.html

@ThisIsFrederic 9 месяцев назад

Hi Dustin, Thank you so much for sharing this demo with us. While trying to adapt it to my environment (I am using Synapse), I am facing an issue that I hope you could help me resolve: when the target delta table does not exist, I noticed that after I create it, CDF shows being enabled only with version 1 and not 0. The initial version 0 is for the initial WRITE only, no CDF enabled. Consequently, I cannot use your trick to load everything from version 0 if the table does not exist. I tried to use the "SET spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;" but Synapse seems to ignore it completely. I also tried to include the option of enabling CDF while saving the delta table like shown below, but again, CDF gets only enabled with version 1: df_records.write.format('delta').option("delta.enableChangeDataFeed", "true").save(target_path) Any clue? Thanks!

@ThisIsFrederic 9 месяцев назад

Well, I just discovered that when you create a delta table, adding option("delta.enableChangeDataFeed", "true") is not enough. When creating the temnp view to switch to SQL, then you also need to add the delta.enableChangeDataFeed = true option to the TBLPROPERTIES when issuing the CREATE OR REPLACE TABLE statement, and this works. Still, the question about enabling by default CDF in Synapse remains, if ever you have a clue. Thanks!

@user-lr3sm3xj8f 9 месяцев назад

how does this work within a team with multiple projects? How do I apply multiple projects in github actions? Am I creating a bundle folder for project? Or do I have a mono folder with everything databricks in it?

@DustinVannoy 6 месяцев назад

You can have different subfolders in your repo each with their own bundle yaml or you could have one at a root level and import different resource yaml files. It should only deploy the assets that have changed so I tend to suggest one bundle if everything can be deployed at the same time.

@user-vq7er5ft6r 9 месяцев назад

Thanks Dustin.

@user-eg1hd7yy7k 9 месяцев назад

Great Video, ! What shoud be the best approach to switch between dev and prod inside the codes ? example: df_test.write.format('delta').mode('overwrite').saveAsTable("dev_catalog.schema.table") how can i parametrize this to automatically change to this: df_test.write.format('delta').mode('overwrite').saveAsTable("prod_catalog.schema.table")

@benjamingeyer8907 9 месяцев назад

environment = os.environ["ENV"] Attach env on the cluster level in the DAB spark_env_vars: ENV: ${var.ENV}

@nebiyuyouhannes6047 10 месяцев назад

this was super usefull thanks

@MarnixLameijer 10 месяцев назад

You cant upload a .txt file anymore, could you help explain how to set this up now?

@DustinVannoy 6 месяцев назад

See the new video instead which covers uploading a few different ways: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-CVzGWWSGWGg.html