Clear Databricks Artifact Location

428 views Asked by At

I am using dbx cli to deploy my workflow into databricks. I have .dbx/project.json configured below:

{
    "environments": {
        "default": {
            "profile": "test",
            "storage_type": "mlflow",
            "properties": {
                "workspace_directory": "/Shared/dbx/projects/test",
                "artifact_location": "dbfs:/dbx/test"
            }
        }
    },
    "inplace_jinja_support": false,
    "failsafe_cluster_reuse_with_assets": false,
    "context_based_upload_for_execute": false
}

Everytime when I run dbx deploy ..., it stores my tasks scripts into the DBFS with some hash folder. If I ran 100 times dbx deploy ..., it creates 100 hash folders to store my artifacts.

Questions

  1. How do I clean up the folders ?
  2. Any retention policy or rolling policy that keeps the last X folders only ?
  3. Is there a way to reuse the same folder everytime we deploy ?

As you can see, there are alot of folders generated whenever we ran dbx deploy. We just want to use the latest, the older one is not needed any more

enter image description here

2

There are 2 answers

0
jlim On

I finally found a way to remove the old DBFS files. I just ran dbfs rm -r dbfs:/dbx/test before running deploy. This method is not ideal because if you have running cluster or cluster pending to start, it will fail due to the previous hash folder is being removed. Instead of depending on DBFS, i have configure my workflow to use GIT, this way i can remove the DBFS data without worrying any job is using it. It is strange that databricks still generate hash folder although no artifacts is uploaded to DBFS file system while using GIT as workspace

0
renardeinside On

author of dbx here.

There is a built-in command that cleans up the workspace and the artifact location:

dbx destroy ...

Please carefully read the documentation before running this command.