Upload configuration files as artifact or assets for a python wheel task depending on target

403 views Asked by At

I could not find any answers in the Databricks' documentation or even the current databricks-cli repository and I faced a problem during my migration from dbx setup. The migration example in the documentation is quite reduced and do not cover other aspects of the deployment as parameter files for the jobs.

My use case is the bundle deployment of a python wheel job with parameters passed as a file.

# The main job for package_name
artifacts:
  package_wheel:
    build: poetry build
    path: ..
    type: whl
  config_file:
    build: echo Under build
    files:
      - source: test_conf.yaml
    path: ../conf
    type: yaml

resources:
  jobs:
    package_name_job:
      name: package_name_job

      schedule:
        quartz_cron_expression: '44 37 8 * * ?'
        timezone_id: Europe/Amsterdam

      email_notifications:
        on_failure:
          - [email protected]

      tasks:
        - task_key: main_task
          job_cluster_key: job_cluster
          python_wheel_task:
            package_name: package_name
            entry_point: main
            parameters: ["--conf-file", config_file]
          libraries:
            # By default we just include the .whl file generated for the package_name package.
            # See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
            # for more information on how to add other libraries.
            - whl: ../dist/*.whl

      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster:
            spark_version: 13.3.x-scala2.12
            node_type_id: n1-standard-4
            autoscale:
                min_workers: 1
                max_workers: 4

I just want to configure the job to deploy using different config_files depending on the targets described in my databricks.yaml file. However I am not able of making databrick-cli automatically recognize that files as artifacts and upload them to the .bundle/[package_name]/[target]/files path as the built wheel is copied/uploaded to the `.bundle/[package_name]/[target]/artifacts.

I tried to define the config-file as an artifact and use the reference but it does not work.

# The main job for package_name
artifacts:
  ...
  config_file:
    build: echo Under build
    files:
      - source: test_conf.yaml
    path: ../conf
    type: yaml

resources:
  jobs:
    package_name_job:
      name: package_name_job

      ...

      tasks:
        - task_key: main_task
          job_cluster_key: job_cluster
          python_wheel_task:
            package_name: package_name
            entry_point: main
            parameters: ["--conf-file", ${artifacts.config_file}] # <-- Reference as in terraform?
      ...
1

There are 1 answers

1
nenetto On

I figure it out ✅

The trick was made by using the sync configuration parameter inside the targets definition in the databricks.yml bundle definition file.

Just to clarify, my python package is handle by [poetry] so I have this project structure:

[project]
  |
  |- pyproject.toml
  |- databricks.yml
  |- [resources]
  |   |- package_name_job.yml
  |- [conf]
  |   |- [dev] # The same as `target`
  |       |- job_conf.yml
  |- [src]
  |   |- [python_package_name]
  |       |- __init__.py
  |       |- main.py

databricks.yml

bundle:
  name: package_name

include:
  - resources/*.yml

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://myhost.databricks.com
      profile: develop
    sync:
      include: # This makes folder "conf" to be included into the bundle/files path. By default to `${workspace.root}/files/conf`
      - conf/

Then, I use the reference in the job definition

package_name_job.yml

# This allow me to build the poetry package in dist folder as the artifact.
artifacts:
  package_wheel:
    build: poetry build
    path: ..
    type: whl

resources:
  jobs:
    package_name_job:
      name: package_name_job

      tasks:
        - task_key: main_task
          job_cluster_key: job_cluster
          python_wheel_task:
            package_name: package_name
            entry_point: main
            parameters:
            # ${bundle.target} == dev (for this example) 
              - "--conf-file"
              - ${workspace.root_path}/files/conf/${bundle.target}/test_conf.yaml
          libraries:
            - whl: ../dist/*.whl

      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster:
            spark_version: 13.3.x-scala2.12
            node_type_id: n1-standard-4
            autoscale:
                min_workers: 1
                max_workers: 4

By creating new folders with the same configuration file name for each target this allow me to deploy de job with a different configuration file depending on the target and using the same logic, thus not-modifying my source code, just configuration.