How can I deploy arbitrary files from an Azure git repo to a Databricks workspace?

1.2k views Asked by At

Databricks recently added support for "files in repos" which is a neat feature. It gives a lot more flexibility to the projects, since we can now add .json config files and even write custom python modules that exists solely in our closed environment.

However, I just noticed that the standard way of deploying from an Azure git repo to a workspace does not support arbitrary files. First off, all .py files are converted to notebooks, breaking the custom modules that we wrote for our project. Secondly, it intentionally skips files ending in one of the following: .scala, .py, .sql, .SQL, .r, .R, .ipynb, .html, .dbc, which means our .json config files are missing when the deployment is finished.

Is there any way to get around these issues or will we have to revert everything to use notebooks like we used to?

2

There are 2 answers

16
Alex Ott On BEST ANSWER

You need to stop doing deployment the old way as it depends on the Workspace REST API that doesn't support arbitrary files. Instead you need to have a Git checkout in your destination workspace, and update that checkout to a given branch/tag when doing release. This is could be done via Repos API, or databricks cli. Here is an example of how to do that with cli from DevOps pipeline.

- script: |
    echo "Checking out the releases branch"
    databricks repos update --path $(STAGING_DIRECTORY) --branch "$(Build.SourceBranchName)"
  env:
    DATABRICKS_HOST: $(DATABRICKS_HOST)
    DATABRICKS_TOKEN: $(DATABRICKS_TOKEN)
  displayName: 'Update Staging repository'
0
HeyWatchThis On

I like the Repos API approach that Alex describes too, though in my case, I found the Databricks sdk as another good alternative, because the project could not easily migrate to "Repos".

import configparser
from pathlib import Path
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.workspace import ImportFormat

config_path = Path.home() / ".databrickscfg"
config = configparser.ConfigParser()
config.read(config_path)
available_profiles = config.sections()  # ["my_profile", ...]
my_profile = "my_profile"
assert my_profile in available_profiles

blob = "some file".encode("utf-8")
workspace_path = "/some/workspace/path/script.sh"

# Two approaches to build the client
client = WorkspaceClient(profile=some_profile)
# client = WorkspaceClient(
#     host=config[my_profile]["host"], token=config[my_profile]["token"])

client.workspace.upload(
    workspace_path, blob, format=ImportFormat.AUTO, overwrite=True)