MultiProcessing with papermill inside docker container

973 views Asked by At

The problem I am facing is the following:

This is the setup:

  • A docker container with a base image of Python:3.7.8-stretch
  • A Local environment on Ubuntu 20.04

My goal is to launch multiple jupyter notebooks in parallel inside a docker container using Python and the Papermill package.

The Dockerfile:

FROM python:3.7.8-stretch

WORKDIR /opt/app

COPY requirements.txt /opt/app/requirements.txt
RUN python3.7 -m pip install -r requirements.txt

COPY src/* /opt/app/
COPY src/notebooks /opt/app/notebooks

CMD ["python", "main.py"]

The code is right here:

import os
import time
import multiprocess as mp
import papermill as pm


def callback(result):
    print(f"{result} end successfully.")


def main(*args, **kwargs):
    print(kwargs)
    print(args)
    result = pm.execute_notebook(**kwargs)
    return result


if __name__ == "__main__":
    package_name = os.getenv("PACKAGE_FOLDER_NAME")
    folder_name = f"notebooks/{package_name}"
    pool = mp.Pool(processes=2)

    # for every file in the folder, we are going to execute the notebook
    # one great addition is that you can choose which notebooks run inside a package folder
    # if you only specify paramaters for a single one.
    notebook_parameters = eval(os.getenv("NOTEBOOK_PARAMETERS"))
    start_time = time.time()
    for filename in os.listdir(folder_name):
        if filename in notebook_parameters.keys():
            output_filename = filename.replace(".ipynb", "_output.ipynb")
            result = pool.apply_async(
                main,
                kwds={"input_path": folder_name + "/" + filename,
                      "output_path": folder_name + "/output_notebooks" + "/" + output_filename,
                      "parameters": notebook_parameters[filename],
                      "log_output": True},
                callback=callback)
            print(f"notebook {filename} has started.")
            # this prevents race conditions while creating the jupyter notebook kernels.
            time.sleep(2)
    pool.close()
    pool.join()

And here it is the docker command I use to run the container (I didn't specify all env variables because I lost my terminal history, sorry):

docker run -e ${environment_variable} rental-metrics-aggs:latest

I have tried a few things such as the concurrent package, but even though the notebook execution finishes it throws an error at the end.

The odd part about this code is: it finishes executing a bunch of notebooks when I run locally (I launch the script by both Pycharm and terminal) but it does not in the Docker container; the processes keeping hanging even though the notebooks were finished and I confirmed that because I can see the results on API calls made.

So, if there anything I could improve the question, please, let me know.

0

There are 0 answers