Advancing Spark - External Tables with Unity Catalog

Подписаться 32 тыс.

Просмотров 15 тыс.

50% 1

A very common pattern is for companies to have many different lakes, whether as part of a mesh, or the simple realities of large companies. But with Unity Catalog expecting a single lake for each metastore, how do we manage access to other cloud stores?
In this video, Simon looks at the setup of storage credentials and managed identities, followed by the creation of external, unmanaged tables.
The primary steps for external tables can be found here: docs.microsoft.com/en-us/azur...
The managed identity setup steps can be found here: docs.microsoft.com/en-gb/azur...
As always, if you're setting up a Lakehouse and need a helping hand, get in touch with Advancing Analytics

Опубликовано:

24 авг 2022

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 34

@AyyyyyyyyyLmao Год назад

I came here looking for tips on how to build a table within the unity game engine. Not what I asked for but a great video nonetheless!

@alexischicoine2072 Год назад

Creating tables as external has saved me many times when deleting tables by mistake. Having that extra step of having to delete the data in storage isn't so bad if the data is important and hard to recover. Then if you messed up recreating the external table is very easy. Obviously if you're in a mature organization where you don't do anything manual in prod it's not as much of an issue.

@paulroome3928 Год назад

Great video! One suggestion - each external location can/should be used to create many external tables. Register an external location on a parent folder of your adls account, and when you create external tables in child directories, unity will automatically figure out you have access to do that!

@AdvancingAnalytics Год назад

Oh absolutely! The location I added was for the full lake container in this case, apologies if that wasn't clear in the vid!

@ShravanKumar-yv4be Год назад

Any suggestions as to when to use Managed and External tables? Would it be a good idea to use managed for bronzer/silver and external for Gold layer?

@RaiOkami Год назад

Great videos! I am relatively new to Databricks and even more so for UC. So your videos have been a really great help! I am interested in implementing it to our project for a client just to get the permissions and governance stuff out of the way. But what exactly is the best practice for storing these tables? Is it really better to save them as external tables than managed? I was told from the forums that UC would handle the file saves in ADLS plus the actual table registration in the metastore. Yet, by default it is still a managed table.

@AdvancingAnalytics Год назад

So our /old/ advice was always to use external for full control over storage. Given UC gives far more flexibility, that's less important. Now the question is how strongly coupled do you want the SQL object to the underlying data. If someone deletes the table, do you want the data to be deleted too? Comes down to process/ownership

@RaiOkami Год назад

@@AdvancingAnalytics Appreciate the response! I see. So for managed tables the real big drawback is that the data itself is tied to the table object in metastore (be it in hive or UC). Unlike external tables where you can drop the table "representation" of the files from ADLS but can anytime pull them back with all the original data untouched. Did I get the rationale?

@AdvancingAnalytics Год назад

@@RaiOkami yep, exactly

@KarloKotarac Год назад

Thnx for the great video(s)! Unfortunately, I am not able to create an external location as you suggested in the video root@storage_account_name that should (I guess) point to the whole storage account - I need to do it container by container. Also creating external tables doesn't seem to be possible anymore by using storage credentials directly (previously it was by WITH command). Do you have any idea how to address this issue?

@KarthikPaladugu-kz8rt Год назад

In articles we are not seeing how we need to update external locations in existing workflows after enabling unity catalog. we can not use DBFS as per recommendations, we need to use external locations, how to update existing code to point to external locations. we will use upgrade option in unity catalog to migrate external tables but how to update workflows to point external locations

@palanithangaraj797 Год назад

can you create persistent views from external tables to within Unity Catalog?

@yashgoel5315 Год назад

Is there any video on how to setup unity catalog??

@dofa47 Год назад

Hi, late to the party but I have a question: Can I have a read only access to the storage account with the Data? So one cannot modifiy the prod data? In other words, where the metadata of the external table are saved?

@BritonWells 11 месяцев назад

Is there a solution/update to the issue around 15:00 - to me that seems like a deal breaker? I want to expose the same external table to many curated catalogs. Do managed tablea have the same limit?

@alexanderowens4898 Год назад

Always love your videos…a couple of questions: 1. How can I provide an external path using Scala saveAsTable() for UC? 2. Wouldn’t the use of external tables limit the ability to get and use the lineage tracking if you loads data from and then save data to external locations (unmanaged tables)?

@AdvancingAnalytics Год назад

Hey! When using external tables traditionally, you can create the Database with a location (CREATE DATABASE LOCATION '/mnt/data/mydatabase/' etc. Then if you use saveAsTable() it would inherit the external location of that database. I've not tried it with UC, so can't guarantee that it works! And with Lineage, I've not looked into how supported external tables are currently, but when lineage is GA, I'd hope that it supports both types of table!

@andrewli7542 Год назад

Something like df.write.option("path", "s3://some/path").saveAsTable("t") should work. Lineage will support both managed and external tables.

@alexanderowens4898 Год назад

@@andrewli7542 that was initially what I had tried; though we use an azure mount path /mnt/account/somepath and I was receiving an "Missing Cloud file system scheme" error notice.

@andrewli7542 Год назад

@@alexanderowens4898 You can't use mounts with external locations for now. A full path with abfss:// is needed.

@alexanderowens4898 Год назад

@@andrewli7542 great tip, this is what I ended up doing .write.option("path", s"abfss://$container@$account.dfs.core.windows.net/$system/$table").saveAsTable(s"$container.$system.$table") and it worked great! Thank you very much.

@rakeshprasad4969 Год назад

is this managed identity for all the workspaces i have in my tenant? or its for one specific one? if its latter, how do we know MI belongs to which workspace?

@rakeshprasad4969 Год назад

i should have watched it bit more. step 8:18 explains it. Thanks. good video. i think ADB should have this MI concept without unity too. its a good way and makes it at par with ADF or any other Azure PaaS (from identity perspective). can any one use MI that are in catalogue? or ADB has released grant for MI too? i would like to control who/which code has access to which MI.

@AdvancingAnalytics Год назад

@@rakeshprasad4969 good question, haven't looked at the permission model for who/what has access to the managed instance. I assume the access is delegated by the user who adds the credential to the workspace

@t0nyc0llins Год назад

@@rakeshprasad4969 @Advancing Analytics I'd really like to understand this too. If I create an MI with contributor to the root, can all analysts use this connection to read/write anything in the lake? Hopefully I am missing something key.

@blackdeckerz0r Год назад

Only problem with abfss is that python only code like pandas and open() doesn't work with that path, we are currently migrating from mounting the storage account to abfss, and we found this limitation

@AdvancingAnalytics Год назад

Unity Catalog is expecting you to work entirely through the SQL tables registered. So it's only with the initial table registry that you use the full abfss:// path... That said, harder to use pandas with the SQL tables ;)

@andrewli7542 Год назад

For now, access to External location paths is limited to Spark Dataset API or dbutils.fs API. PySpark Pandas APIs that use Spark Dataset APIs under the hood should also work like ps.read_table().

@t0nyc0llins Год назад

Can I ask how you now access files on your datalake with pandas etc? We are just starting to move to Unity Catalog and I am unsure how we do this without mounting etc

@mauromi68 Год назад

really a great video. I'm new to DataBricks Unity Catalog and I tried to replicate these steps, but I still get the error "Error in SQL statement: UnauthorizedAccessException: PERMISSION_DENIED: request not authorized" It seems to me I did whatever I had to do: I created a Databricks access connector in Azure (which becomes a managed identity) I created a storage Account ADLS Gen2 (DAtalake with hierarchical namespace) plus container On my datalake container I assigned Storage Blob Data Contributor role to the managed identity above I created a new Databricks Premium Workspace I created a new metastore in Unity Catalog that "binds" the access connector to the DataLake Bound the metastore to the premium databricks workspace I gave my Databricks user Admin permission on the above Databricks workspace I created a new cluster in the same premium workspaces, choosing framework 11.1 and "single user" access mode I ran the workspace, which correctly created a new catalog, assinged proper rights to it, created a schema, confirmed that I am the owner for that schema The only (but most important) SQL command of the same notebook that fails is the one that tries to create a managed Delta table and insert two records: CREATE TABLE IF NOT EXISTS quickstart_catalog_mauromi.quickstart_schema_mauromi.quickstart_table (columnA Int, columnB String) PARTITIONED BY (columnA); When I run it, it starts working and in fact it starts creating the folder structure for this delta table in my storage account enter image description here , however then it fails with the following error: java.util.concurrent.ExecutionException: Failed to acquire a SAS token for list on /data/a3b9da69-d82a-4e0d-9015-51646a2a93fb/tables/eab1e2cc-1c0d-4ee4-9a57-18f17edcfabb/_delta_log due to java.util.concurrent.ExecutionException: com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: request not authorized Please consider that I didn't have any folder created under "unity-catalog" container before running the table creation command. So it seems that is can successfully create the folder structure, but after it creates the "table" folder, it can't acquare "the SAS token". So I can't understand since I am an admin in this workspace and since Databricks managed identity is assigned the contributor role on the storage container, and since Databricks actually starts creating the other folders. What else should I configure?

@mauromi68 Год назад

I found it: you need to only to assign, at container level, the Storage Blob Data Contributor role to the Azure Databricks Connector. In fact, you need to assign the same role and the same connector at STORAGE ACCOUNT level. I couldn't find this information in the documentation and I frankly can't understand why this is needed since the delta table path was created. However, this way, it works.

@majetisaisowmya909 10 месяцев назад

can u please help me "create metastore,catalog ,table automatically using python or powershell

@AdvancingAnalytics 10 месяцев назад

So you'll still need to throw that over to the SQL side of things, but you can build up the SQL string programatically. For example, using python inside a Databricks notebook you could use: CatName = "myCatalog" SQLString = f"CREATE CATALOG {CatName}" spark.sql(SQLString) That's going to write a sql command for you, then execute that command. You can use that in a loop to programatically create schemas, tables, whatever you want!