How can I import a local module using Databricks asset bundles?

675 views Asked by At

I want to do something pretty simple here: import a module from the local filesystem using databricks asset bundles. These are the relevant files:

databricks.yml

bundle:
  name: my_bundle

workspace:
  host: XXX

targets:
  dev:
    mode: development
    default: true

resources:
  jobs:
    my_job:
      name: my_job
      tasks:
        - task_key: my_task
          existing_cluster_id: YYY
          spark_python_task:
            python_file: src/jobs/bronze/my_script.py

my_script.py

from src.jobs.common import *

if __name__ == "__main__":
    hello_world()

common.py

def hello_world():
    print("hello_world")

And the following folder structure:

databricks.yml
src/
├── __init__.py
└── jobs
    ├── __init__.py
    ├── bronze
    │   └── my_script.py
    └── common.py

I'm deploying this to my workspace + running it by using Databricks CLI v0.206.0 with the following commands:

databricks bundle validate
databricks bundle deploy
databricks bundle run my_job

I'm getting issues to import my common.py module. I'm getting the classic ModuleNotFoundError: No module named 'src' error here.

I've added the __init__.py files as I typically do when doing this locally, and tried the following variations:

from src.jobs.common import *
from jobs.common import *
from common import *
from ..common import *

I guess my issue is that I don't really know what the python path is here, since I'm deploying it on Databricks. How can I do something like this using databricks asset bundles?

2

There are 2 answers

0
kries On BEST ANSWER

I recently ran into a similar issue, albeit with notebook tasks, and came to the following resolution, adapted to your example file structure:

In your databricks.yml file, pass an argument to your script via parameters:

databricks.yml

    resources:
      jobs:
        my_job:
          name: my_job
          tasks:
            - task_key: my_task
              existing_cluster_id: YYY
              spark_python_task:
                python_file: src/jobs/bronze/my_script.py
                parameters: ['/Workspace/${workspace.file_path}/src']

main.py

    import sys
    bundle_src_path = sys.argv[1]
    sys.path.append(bundle_src_path)

    from src.jobs.common import *

Caveats:

  1. As mentioned, I am using a notebook_task which allows me to pass in parameters that I can read using dbutils. I haven't tested the spark_python_task parameters passing above, but it appears similar and may at least be enough to get you into a working state. Databricks API reference
  2. The sys path technique is recommended by Databricks docs, though you may need to approach it differently based on your runtime version.
  3. This works well for non-production deployment targets. The parameter passed in for production would likely need to be modified (though I haven't made it that far myself yet)!
0
moro clash On

you can use that snippet too, at the beginning of your notebook

import os
import sys
# get the current path
notebook_path =  '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
sys.path.append(notebook_path)