Git clone to a Databricks Unity Catalog enabled Volume

188 views Asked by At

I'm migrating the current hive metastore tables in my Azure Databricks workspace to Unity Catalog (UC), and I encountered and issue related to git clone to a Volume.

So my cluster setting will be something like:

  • DBR 13.3 LTS
  • Mode: Shared (UC enabled)

So earlier, in my non UC enabled cluster I would have a cell in the notebook like the following to git clone my repo to a DBFS tmp location:

!git clone https://[email protected]/repo_path /tmp/repo

But now since my UC enabled cluster I want to clone the repo inside a volume so I can potentially remove the repo directory at the beginning of the notebook (dbutils.fs.rm("/Volumes/catalogname/schemaname/volumename/tmp/repo", True) which works), like the following:

!git clone https://[email protected]/repo_path /Volumes/catalogname/schemaname/volumename/tmp/repo

But appears to get stuck in the Resolving deltas step while cloning.

Does anyone has faced this issue, and got a solution to this? I'm thinking maybe the git clone has to be done differently now, or my last option is to maybe include the git clone command in a init script, and make the UC enabled cluster run it when starting up.

1

There are 1 answers

0
Lucas Mengual On BEST ANSWER

Found a workaround which solves the issue initially posted. A modified a CI/CD azure devops pipeline I has running already which in my case runs on the same repository I need to clone, but also can clone external repositories.

First I included a new task during the build stage to copy the repository in a directory, so the task after publishes the directory into an artifact:

- script: | # Copy git repo to tmp repo directory
    mkdir -p $(Build.ArtifactStagingDirectory)/repo
    find $(Build.SourcesDirectory) -mindepth 1 -maxdepth 1 -exec cp -r {} $(Build.ArtifactStagingDirectory)/repo \;
  displayName: Copy repo         
- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)'
    ArtifactName: 'my-artifact'
  displayName: Publish Artifact

Then, the second part is that during the deploy stage (you need a download artifact step too) I included a AzureFileCopy@5 task which copies that directory (aka. my repository) into my ADLS (Azure Data Lake Storage) location, which is the same location my Databrick's UC Volume has access to, and therefore I can see my repository in the UC Volume, like the following:

- task: DownloadBuildArtifacts@1
  inputs:  
    artifactName: my-artifact
    downloadPath: '$(System.ArtifactsDirectory)' 
    displayName: Download Build Artifact
- task: AzureFileCopy@5
  displayName: Copy repo to storage account
  inputs:
    SourcePath: $(System.ArtifactsDirectory)/my-artifact/repo
    azureSubscription: YourAzureSubscriptionName
    Destination: AzureBlob
    storage: YourADLSName
    ContainerName: YourADLSContainerName
    BlobPrefix: YourUCVolumeName/tmp
    AdditionalArgumentsForBlobCopy: |
      --recursive=true `
      --overwrite=true