setup schema in databricks with location using service principal

89 views Asked by At

I am using Azure Databricks using a cluster (LTS 13.3) without Credential Passthrough and based on service principals only.

The goal is to create a schema using a location option on adls (using abfss protocol) so that I can create delta table which gets registered in the catalog and the underlying data gets saved in the specified location.

I am able to read and write data to adls storage account using service principals as expected -

spark.conf.set("fs.azure.account.auth.type.MyStorageAcct.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.MyStorageAcct.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.MyStorageAcct.dfs.core.windows.net", "relevant_vals")
spark.conf.set("fs.azure.account.oauth2.client.secret.MyStorageAcct.dfs.core.windows.net", "relevant_vals")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.MyStorageAcct.dfs.core.windows.net", "https://login.microsoftonline.com/relevant_vals/oauth2/token")

However, when I try to create the schema -

%sql
CREATE schema my_new_schema LOCATION "abfss://[email protected]/my_new_schema.db"

I get this error -

AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.contracts.exceptions.KeyProviderException Failure to initialize configuration for storage account MyStorageAcct.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.key)

I do not have any issues reading or writing data to adls using the service principals. It feels like the issue is with registering the schema/tables in the default metastore? If I create the schema with credential passthrough (as my id has access) I am able to create the schema.. but writing data fails with the error

df.write.format('delta')\
.mode("overwrite")\
.option("path","abfss://[email protected]/my_new_schema.db/table_test_copy_v20240307.delta")\
.saveAsTable('my_new_schema.table_test_copy_v20240307')

fails with :

Failure to initialize configuration for storage account MyStorageAcct.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key

Can someone please help me understand what is the issue and how I can register schema and tables in the databricks hive metastore while specifying location on adls using service principals? Thanks!

1

There are 1 answers

0
DileeprajnarayanThumula On

Regarding the error message:

Failure to initialize configuration for storage account MyStorageAcct.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key

When using the abfss protocol, you cannot directly use a storage account SAS token to access data. When using the abfss protocol, you cannot utilize a storage account access key to access data. To use abfss, you must provide additional configuration options, as detailed in the documentation.

Using a storage access key is only recommended when using the wasbs protocol, but it is not advisable for use with ADLS Gen2. Additionally, if you have the necessary permissions, you can utilize a passthrough cluster to access the storage account.

To create a table in a blob folder location and store the table on an external location.

Set the Variables:

catalog_name = test_catalog
schema_name = test_schema
catalog_container = <catalog container string>
storage_account = <storage account string>
blob_path = f"abfss://{catalog_container}@{storage_account}.dfs.core.windows.net/<folder path>"

Create Catalog:

spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name} MANAGED LOCATION 'abfss://{catalog_container}@{storage_account}.dfs.core.windows.net/'")

Create Schema:

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog_name}.{schema_name}")

References: MS Q&A

SO Link.