Advancing Spark - Developing Python Libraries with Databricks Repos

Подписаться 32 тыс.

Просмотров 20 тыс.

50% 1

The addition of Databricks Repos changed a lot of our working processes around maintaining notebooks, but the process for building out our own python libraries hasn't changed much over the years. With "Files for Databricks Repos", we suddenly see a massive shift in how we can structure our library development, with some huge productivity boosts in there.
In this video, Simon talks through the process from the ground up - taking a simple dataframe transformation, turning it into a function, building that function into a wheel then replacing it with a direct reference inside Databricks Repos!
For more info on the new additions to Databricks Repos, check out docs.databricks.com/repos.htm...
As always, if you need help with our Data Lakehouse journey, stop by www.advancinganalytics.co.uk to see if we can help

Опубликовано:

11 ноя 2021

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 34

@dmitryanoshin8004 2 года назад

You are the god of Databricks!! Enjoying watch and learn))

@lackshubalasubramaniam7311 Год назад

I like the idea of keeping the code in wheel structure so that we can build the wheels for unit testing and possibly integration testing. It's the best of both worlds. Nice!😃

@toddflanders8155 2 года назад

I'm brand new to Databricks. I want to get a team of data engineers to do proper testing with CI/CD. Using wheels seemed to create as many problems as it solved. Local repo python modules that are test-driven seems like a good baby step to ensuring quality while allowing rapid sustainable development.

@niallferguson8019 2 года назад

This is great and think has solved an issue we have been battling with - how multiple people can develop at the same time. I guess for external dependencies you will need to make sure they are already installed on the cluster - so you lose the nice aspect of using pip install on your whl to automatically download them.

@briancuster7355 2 года назад

This is really cool. Thanks for showing this.

@mrakifcakir 2 года назад

Hi Simon thanks for the video. Regarding the your concern in minute 22 (in prod prefer the concrete version of whl file) that can be solved in this way I think. The application repo will create a whl file an upload to s3 with specific version. The versioned package will be loaded during the cluster spinning up. It can be same for dev and test envs too. In order to enable parallel development , the way you show in this video can be used but giving different versions of the whl files (in setup.py) not to overwrite to master package ( not to be updated by different developer) in dev test envs. Therefore while as you explain here with repos import module functionality , so many developer can do changes and test them in the workspace notebook, the prod env can run fixed ( in a regular whl file ) version and control the prod version of whl file. I did a similar process and works fine. Best Akif

@goat4real262 5 месяцев назад

Thanks a lot for putting this video, it is going to save my life

@julsgranados6861 2 года назад

thanks Simon!!!!! just great video , great content

@AdvancingAnalytics 2 года назад

Thanks

@almarey5533 2 года назад

This was awesome, thanks! Definitely thinking I want to change to use this feature rather than the wheels

@AdvancingAnalytics 2 года назад

Good stuff. Let me know how you get on

@marcocaviezel2672 2 года назад

Coool! Thanks for this great video!

@AdvancingAnalytics 2 года назад

Thanks

@sumukhghodke7566 2 года назад

Thanks for this perspective on using packages in notebooks. The dbx init command creates a package structure, but there is no clear documentation on how the package can be built/used within notebooks. It makes more sense now.

@taglud 2 года назад

i really love, solve our interrogations around simmlifying code

@gass8 2 года назад

Hi, nice video. Would like to know if exist any way to do the same in R ? I whant versioning my developed lib and reference it in the script from repository. The material for R is too poor.

@penchalaiahnarakatla9396 2 года назад

Good video. How to read delta table inside UDF. Please suggest..

@MicrosoftFabric 2 года назад

Thanks Simon for sharing your knowledge. Question around managing code for multiple entities. Would you create multiple git repos per databricks entity or 1 repo with all the entities in databricks cicd-labs sort of folder structure?

@AdvancingAnalytics 2 года назад

Great question. This is a big debate. Mono vs multi repo. We always prefer multi repo to make cicd less complicated

@deepanjandatta4622 Год назад

Hi, incase I create a FunctionTest notebook which will be outside of the DBXRepos and came from a different repo all together. In that case, it can I import using "from Library.hydr8.audit.lineage import *" and call addLineage() ? (using the 2nd approach you shown ?)

@penter1992 2 года назад

IF you want to use poetry as dependency manager, how to you solve this? how do you install the dependencies on this repo way?

@chobblegobbler6671 Год назад

Hey Simon!.. How does this work for a job cluster when launching databricks operators from airflow

@dankepovoa Год назад

niiice!!! but, what if I need to use an exclusive databricks command, like "dbutils" in a function of this library? A .py file wouldn't run, what would be another option?

@chandandey572 10 месяцев назад

@AdvancingAnalytics Hi Simon, finally I got a tutorial related to my use case. Thanks a lot for this. I have a query though, if you can please help me on this. I have an application running on linux server which is having shell scripts, conda env setup, pip_req.txt, and python files for etl process. We have sqllite as well for metadata management. In this case if I have to move to databricks with minimum code changes, how should I design this. I mean in databricks for shell scripts calls, sqllite db, conda env setup what should be alternative or will it work as it is.

@aradhanachaturvedi3352 Месяц назад

Hi , I have one question.how can I expose my databricks notebook as an endpoint for front end applications? I think creating workflows and running a job will take some time in producing the output. we want something in real time.can you suggest

@AdvancingAnalytics Месяц назад

The only way to expose "a notebook" is by having your app call the jobs API and triggering that notebook - it can take input parameters, provide outputs etc so would mimic a web service, but you won't get over the latency problem of calling a spark job. If you are trying to return results of a SQL query, use the SQL endpoint & Serverless instead, if you're trying to do inference with machine learning, then use model serving endpoints. Those are pretty much your options!

@darryll127 2 года назад

Simon, would this change your thinking / architecture for Hydro to put Config metadata in JSON files and not bother with a database at all?

@AdvancingAnalytics 2 года назад

Nope, not on its own. It's certainly convenient to reference config directly, but you still need something to manage searching, queueing etc. Definitely a nice pattern for local config, but I'm not convinced I want production systems relying on having the right branch synced to a repo!

@darryll127 2 года назад

@@AdvancingAnalytics I understand what you're saying, but how is that any different than making sure production is built and deployed from the right branch. Furthermore, it seems to me it would eliminate complexity and artifacts they could be rationalized as driven by legacy limitations.

@AdvancingAnalytics 2 года назад

@@darryll127 very true, much of it is "it feels wrong" not "it is wrong". As long as we can still factor in the relevant code quality checks, linting/formatting, syntax checks, testing etc, then there's no real reason that it's "bad". From a wider architecture, keeping metadata purely in the repo means it's not accessible to other tools, so orchestration planes wouldn't see it for example. If you're building an entirely databricks-based architecture might be suitable?

@darryll127 2 года назад

@@AdvancingAnalytics you raise a good point, however I could see a process whereby you have the option in your framework to read the JSON and store the it in Delta tables (taking advantage of things like AutoLoader, schema evolution, complex types) and then take advantage of the Delta readers which are not dependent on Databricks per se as a means of exposing the Config data to external environments.

@allieubisse316 2 года назад

Hello

@paulnilandbbq 2 года назад

Firstly, love the Channel Simon and the team! I have been using %run "../shared/xxxxx" as a technique to consume functions from a notebook stored within a git repo. Is there any downsides to this option? - Many thanks!

@AdvancingAnalytics 2 года назад

Great question. Wheels are transferable and testable. Notes are not as easy to test. Wheels give you a more robust deployment option.