I am using dbx
cli to deploy my workflow into databricks. I have .dbx/project.json
configured below:
{
"environments": {
"default": {
"profile": "test",
"storage_type": "mlflow",
"properties": {
"workspace_directory": "/Shared/dbx/projects/test",
"artifact_location": "dbfs:/dbx/test"
}
}
},
"inplace_jinja_support": false,
"failsafe_cluster_reuse_with_assets": false,
"context_based_upload_for_execute": false
}
Everytime when I run dbx deploy ...
, it stores my tasks scripts into the DBFS with some hash folder. If I ran 100 times dbx deploy ...
, it creates 100 hash folders to store my artifacts.
Questions
- How do I clean up the folders ?
- Any retention policy or rolling policy that keeps the last X folders only ?
- Is there a way to reuse the same folder everytime we deploy ?
As you can see, there are alot of folders generated whenever we ran dbx deploy
. We just want to use the latest, the older one is not needed any more
I finally found a way to remove the old DBFS files. I just ran
dbfs rm -r dbfs:/dbx/test
before running deploy. This method is not ideal because if you have running cluster or cluster pending to start, it will fail due to the previous hash folder is being removed. Instead of depending on DBFS, i have configure my workflow to use GIT, this way i can remove the DBFS data without worrying any job is using it. It is strange that databricks still generate hash folder although no artifacts is uploaded to DBFS file system while using GIT as workspace