Databricks Import/Copy Data from python lib inside repo

2k views Asked by At

i am facing a little challenge while trying to implement a solution using the new repo functionality of databricks. I am working in a interdisziplinairy project which needs to be able to use python und pyspark code. The python team already builded some libraries which now also want to be used by the pyspark team (e.g. preprocessing ect.). We thought that using the new repo function would be a good compromise to collaborate easily. Therefore, we have added the ## Databricks notebook source to all library files so that they can easily changed in databricks (since python development isn't finished yet, the code will also be changed by the pyspark team). Unfortunately, we run into trouble with "importing" the library moduls in a databricks workspace directly from the repo.

Let me explain you our problem in an easy example:

Let this be module_a.py

## Databricks notebook source
def function_a(test):
    pass

And this module_b.py

## Databricks notebook source
import module_a as a
def function_b(test):
    a.function_a(test)
...

The issue is, that the only way to import these module directly in databricks is to use

%run module_a

%run module_b

which will fail since modulbe_b is trying to import module a which is not in the python path.

My idea was to copy the module_a.py and module_b.py file to the dbfs or localFilestore and then add the path to the python path with using sys.path.append(). Unfortunately, I didn't found any possility to access the file from the repo via some magic commands in databricks to be able to copy them to the file store.

(I do not want to clone the repo, since then I need to push my changed everytime before reexecuting the code).

Is there a way to access the repo directoy somehow via a notebook itself, so that I can copy them to the dbfs/filestorage?

Is there another way to import the function correctly ? (Installing the repo as a library on the cluster is not an option, since library will be changed during the process by the developers).

Thanks!

1

There are 1 answers

4
Alex Ott On

This functionality isn't available on Databricks yet. When you work with notebooks in the Databricks UI, you work with objects located in so-called Control Plane that is part of Databricks cloud, while code to be accessible as Python package should be in the data plane that is part of customer's cloud (see this answer for more details).

Usually people split the code into the notebooks that are used as a glue between configuration/business logic, and libraries that contain data transformations, etc. But libraries needs to be installed onto clusters, and usually developed separately from notebooks (there are some tools that helps with that, like, cicd-templates).

There is also the libify package that tries to emulate Python packages on top of Databricks notebooks, but it's not supported by Databricks, and I don't have personal experience with it.

P.S. I'll pass this feedback to development team.

Update: Feb 2023rd. Since later 2021st there is a functionality called "Files in Repos" that allows to import Python or R files (not notebooks) as packages. See demo here.