OS Error When Trying to Import an Installed Python Wheel Package on Azure Databricks

93 views Asked by At

I created a wheel package called my_sdk.whl that I have developed and built locally. This package is meant to use for data transformation specfic to a single project. The goal is to make it easier to run unit tests on top of these transformer functions in a CICD pipeline without the dependency to databricks ie. dbutils.

Now, I tried uploading it to databricks file system under dbfs:/libraries/my_sdk.whl path and installed it in my interactive cluster using the Libraries tab in the Compute page. Restarted the cluster and after successful installation, I tried using it in a databricks repos notebook ie.

import my_sdk

Executing the above code would take 10-20 minutes in the Running command... status

Then after that I will get the following error:

OSError: [Errno 5] Input/output error: '/Workspace/Repos/my-userxxx/path/to/notebooks'
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
File <command-3809752991307962>, line 1
----> 1 import my_sdk

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1002, in _find_and_load_unlocked(name, import_)

File <frozen importlib._bootstrap>:945, in _find_spec(name, path, target)

File <frozen importlib._bootstrap_external>:1439, in find_spec(cls, fullname, path, target)

File <frozen importlib._bootstrap_external>:1411, in _get_spec(cls, fullname, path, target)

File <frozen importlib._bootstrap_external>:1548, in find_spec(self, fullname, target)

File <frozen importlib._bootstrap_external>:1591, in _fill_cache(self)

OSError: [Errno 5] Input/output error: '/Workspace/Repos/my-userxxx/path/to/notebooks'

Any idea why I am getting this?

Additional Info:

  • It takes too long to execute but sometimes it can import the package successfully, and sometimes not

  • One thing that I noticed is when I run a notebook (not under the Repos folder) with a single cell import my_sdk, it can import the package without any issues. I believe it has to do with the library precedence as mentioned in microsoft documentation. Based on the second precedence Libraries in the Repo (Git folder) root directory (Repos only). It could be because the root folder in my Repos Workspace contains both adf resources and databricks resources which is why databricks takes so much time to search for a matching python package.

  • After fixing this, I will run the notebook on a Job Cluster and orchestrate it using Azure Databricks.

  • I am using Windows 10 and Python 3.10.11 to compile the wheel package.

  • The command I used to compile the wheel package is python -m build --wheel

  • My interactive cluster runtime version is 13.3 that terminates after 20 minutes.

  • The setup.py file contains the following:

"""Setup.py script for packaging project."""

from setuptools import setup, find_packages

import os


def read_pip_requirements(filename: str):
    filepath = os.path.join(os.path.dirname(__file__), filename)
    with open(filepath) as f:
        return f.readlines()


if __name__ == '__main__':
    sdk_version = os.environ.get("BUILD_NUMBER")

    if sdk_version is None:
        raise ValueError("SDK Version Cannot be Null. Did you initialized the BUILD_NUMBER variable?")

    setup(
        name="my_sdk",
        version=sdk_version,
        package_dir={"": "src"},
        packages=find_packages(where="src", include=["my_sdk*"]),
        description="Software Development Kit for My Project",
        install_requires=["pyspark==3.4.1"]
    )

I tried the following but I am still facing the long running cell issue during the import and will randomly get an OS Error or a Successful Import after waiting.

  • I have tried installing this package in a virtual environment using pip install my_sdk.whl, and tried using the modules on a local pyspark application, and all are working perfectly.
  • Running %pip freeze command and it shows that the package was installed @ file:///local_disk0/tmp/addedFile375359a4e2e749fba4206df7c97999b07096403526362698460/my_sdk-10003-py3-none-any.whl
  • Using the web terminal and run python to see if I can import my_sdk, and I can import it Real Fast without any issue
  • Restarting the interactive cluster and running the notebook
  • Spinning a job cluster in a ADF Databricks Notebook activity with the wheel package configured in the Append libraries
  • Using an older Databricks Runtime version
0

There are 0 answers