The problem I am facing is the following:
This is the setup:
- A docker container with a base image of Python:3.7.8-stretch
- A Local environment on Ubuntu 20.04
My goal is to launch multiple jupyter notebooks in parallel inside a docker container using Python and the Papermill package.
The Dockerfile:
FROM python:3.7.8-stretch
WORKDIR /opt/app
COPY requirements.txt /opt/app/requirements.txt
RUN python3.7 -m pip install -r requirements.txt
COPY src/* /opt/app/
COPY src/notebooks /opt/app/notebooks
CMD ["python", "main.py"]
The code is right here:
import os
import time
import multiprocess as mp
import papermill as pm
def callback(result):
print(f"{result} end successfully.")
def main(*args, **kwargs):
print(kwargs)
print(args)
result = pm.execute_notebook(**kwargs)
return result
if __name__ == "__main__":
package_name = os.getenv("PACKAGE_FOLDER_NAME")
folder_name = f"notebooks/{package_name}"
pool = mp.Pool(processes=2)
# for every file in the folder, we are going to execute the notebook
# one great addition is that you can choose which notebooks run inside a package folder
# if you only specify paramaters for a single one.
notebook_parameters = eval(os.getenv("NOTEBOOK_PARAMETERS"))
start_time = time.time()
for filename in os.listdir(folder_name):
if filename in notebook_parameters.keys():
output_filename = filename.replace(".ipynb", "_output.ipynb")
result = pool.apply_async(
main,
kwds={"input_path": folder_name + "/" + filename,
"output_path": folder_name + "/output_notebooks" + "/" + output_filename,
"parameters": notebook_parameters[filename],
"log_output": True},
callback=callback)
print(f"notebook {filename} has started.")
# this prevents race conditions while creating the jupyter notebook kernels.
time.sleep(2)
pool.close()
pool.join()
And here it is the docker command I use to run the container (I didn't specify all env variables because I lost my terminal history, sorry):
docker run -e ${environment_variable} rental-metrics-aggs:latest
I have tried a few things such as the concurrent package, but even though the notebook execution finishes it throws an error at the end.
The odd part about this code is: it finishes executing a bunch of notebooks when I run locally (I launch the script by both Pycharm and terminal) but it does not in the Docker container; the processes keeping hanging even though the notebooks were finished and I confirmed that because I can see the results on API calls made.
So, if there anything I could improve the question, please, let me know.