Problem with Python environment and Slurm (srun/sbatch)

5.6k views Asked by At

I'm running into a problem when I try to setup a virtual environment on Ubuntu, using "virtualenv --system-site-packages myenv", and trying to run my python script with slurm (srun/sbatch)

Although I have run my code without problem in the past, at this point i'm getting an "ModuleNotFoundError" when trying to run with my environment activated (source ./myenv/bin/activate)

I noticed that although I can run "python foo.py" normally with my current environment activated, "srun python foo.py" fails. In fact, by printint sys.version, I can see that the python version running with srun is different from the python command withour srun, which tells me that the environment is changed (and thus cannot find my packages). "srun python --version" also confirms this.

Has anyone had a similar problem?

Thanks

2

There are 2 answers

0
Marcus Boden On

The python environment is set via environment variables and Slurm does not always carry your current environment into your job. You can specify it with the --export option, e.g. with --export=ALL. This should be the default if nothing is specified, but your admins might have changed it via specific Slurm environment variables.

Another way around this would be to load the virtual environment in your jobscripts, if you use sbatch.

0
clgp On

I had a similar issue with Slurm Version 20.11.7
I had a virtual environment created with the systems python3, which was Python 3.6.8
When activating the venv on the logging-node calling an installed module worked fine, but from within the following shell script for example it did not and resulted in ModuleNotFound:

#!/bin/bash

#SBATCH --partition=gpu         #use GPU partition
#SBATCH --nodes=1               #number of nodes 
#SBATCH --gres=gpu:2            #number of GPUs per node 
#SBATCH --job-name=joeynmt_test
#SBATCH --mail-user=email
#SBATCH --mail-type=all
#SBATCH --ntasks=1
#SBATCH --mem=24G
#SBATCH --time=08:00:00
#SBATCH --qos=standard


source /home/.../bin/activate   #activate venv
python3 --version
which python3


python3 -m myModule

Calling python3 --version directly after activating the venv resulted in systems python and its location insted of the python from the venv

What worked for me was loading a newer Python version (module add Python/3.8.6-GCCcore-10.2.0) and then creating the venv and then accoringly in the shell script:

#!/bin/bash

#SBATCH --partition=gpu         #use GPU partition
#SBATCH --nodes=1               #number of nodes 
#SBATCH --gres=gpu:2            #number of GPUs per node 
#SBATCH --job-name=joeynmt_test
#SBATCH --mail-user=email
#SBATCH --mail-type=all
#SBATCH --ntasks=1
#SBATCH --mem=24G
#SBATCH --time=08:00:00
#SBATCH --qos=standard

module add Python/3.8.6-GCCcore-10.2.0

source /home/.../bin/activate   #activate venv
python3 --version
which python3


python3 -m myModule

Submitting this to Slurm with sbatch did not raise any erros and the venv was succesfully "transfered" to the worker node. Maybe helpful for others.