No video :(

Databricks Unity Catalog : Setup and Demo on AWS

Подписаться 1,6 тыс.

Просмотров 6 тыс.

50% 1

Learn as we walk through step by step how to start your Lakehouse journey with Databricks Unity Catalog on Amazon Web Services (AWS). In this video we'll go through the entire process from creating the S3 bucket and writing IAM policy, to creating the Unity Catalog Metastore and demonstrating it in action with Databricks SQL!
Unity Catalog is a product on Databricks that unifies data governance on the Databricks platform, enabling your organization to develop strong access control on its data, analytics, and AI. Beyond access control lists, Unity Catalog also provides a number of other useful features such as Data Lineage to track where your data assets are being used from both upstream and downstream.
Link to Unity Catalog overview:
docs.databrick...
Documentation to get started with Unity Catalog (for the IAM policy snippets from the video, please see the following link):
docs.databrick...
If you prefer deploying your cloud infrastructure as code, check out the following guide on setting up everything you need for Unity Catalog using Terraform!
registry.terra...
Thanks again for watching!

Опубликовано:

21 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 15

@ft_angel91 7 месяцев назад

By far the best tutorial I've seen. Thank you for putting this out.

@kunnunhs1 3 месяца назад

it's worst unclear

@chaitanyamuvva 2 месяца назад

Thanks for posting!! much needed stuff.

@aaronwong8533 11 месяцев назад

This is so helpful! Thank you for posting.

@hassanumair6967 Год назад

Another suggestion if you want to made that tutorial type video that would be great. This video cover backup and restoration of databricks like what we save in our S3 and what are parallel methods. Restoration policies specifically if we use geo-redundant structure with wide number of users.

@SaurabhKumar-ic7nt Год назад

awesome explanation

@MakeWithData Год назад

Thank you!

@AthenaMao 3 месяца назад

Where can I find the json template of custom trust policy

@rajanzb 6 месяцев назад

Wonderful demo. Have a question, where did you link up the UnityCalatog created in Metastore to the catalog on the data explorer? How is the s3 bucket attached to this table created in the schema of dev catalog? Please clarify.

@MakeWithData 6 месяцев назад

Thanks! Metastores are assigned to the workspace at the account level, then any catalogs you create in the workspace are automatically associated with that metastore, and you can also only have one metastore assigned to a workspace. When you create a metastore, you must configure a default S3 bucket for the metastore, so your schemas/tables/etc will be stored in that bucket by default; however you can also setup additional buckets as "External Locations" in UC and then use those as the default root storage location for specific catalogs or schemas you create. Hope this helps!

@lostfrequency89 Месяц назад

Is it possible to create volumes on top of this external storage container ?

@hassanumair6967 Год назад

and what if we want to create volume, I am stuck while doing databricks configuration with AWS and using demo version of premium The problem where i have been stuck is default metastore which occurs every time when i try to create volume.

@MakeWithData Год назад

Hi, I recommend submitting a Question to stackoverflow using the [databricks] tag. Myself and several others are very active in that forum and would be happy to help, given more details about your use case! Thank you for watching!

@user-px2pz3ec1x 10 месяцев назад

Thank you for the video. I have a large(~15gb) csv file in s3. how can i process that data in databricks. I dont want to mount the s3 bucket. Is there any way i can process this file in databricks other than mounting it?

@MakeWithData 9 месяцев назад

Yes, no need to mount your bucket, you can read that from a pyspark or scala notebook in databricks with spark.read.csv("s3://path/to/data") 15GB for a single file is quite large though, I would recommend trying to split it up into multiple smaller files if possible, so that you can realize maximum parallelism from your Spark cluster. Ideally you can even convert that to Delta Lake format. If you don't split it up or convert it, you may need a cluster with more memory available.