How do I have condor automatically import my conda environment when running my python jobs?

1.8k views Asked by At

I am submitting my jobs to condor but it says that tensorboard is not installed which is false because I ran in on an interactive job, so it is installed.

How do I have condor use my current active conda environment?

My condor submit script:

####################
#
# Experiments script
# Simple HTCondor submit description file
#
# reference: https://gitlab.engr.illinois.edu/Vision/vision-gpu-servers/-/wikis/HTCondor-user-guide#submit-jobs
#
# chmod a+x test_condor.py
# chmod a+x experiments_meta_model_optimization.py
# chmod a+x meta_learning_experiments_submission.py
# chmod a+x download_miniImagenet.py
#
# condor_submit -i
# condor_submit job.sub
#
####################

# Executable   = meta_learning_experiments_submission.py
# Executable = automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
# Executable = ~/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
Executable = /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py

## Output Files
Log          = condor_job.$(CLUSTER).log.out
Output       = condor_job.$(CLUSTER).stdout.out
Error        = condor_job.$(CLUSTER).err.out

# Use this to make sure 1 gpu is available. The key words are case insensitive.
REquest_gpus = 1
# requirements = ((CUDADeviceName = "Tesla K40m")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.gpus >= Requestgpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
# requirements = (CUDADeviceName == "Tesla K40m")
# requirements = (CUDADeviceName == "Quadro RTX 6000")
requirements = (CUDADeviceName != "Tesla K40m")

# Note: to use multiple CPUs instead of the default (one CPU), use request_cpus as well
Request_cpus = 8

# E-mail option
Notify_user = [email protected]
Notification = always

Environment = MY_CONDOR_JOB_ID= $(CLUSTER)

# "Queue" means add the setup until this line to the queue (needs to be at the end of script).
Queue

first few lines of my submission script until the failure line:

#!/home/miranda9/.conda/bin/python3.7

import torch
import torch.nn as nn
import torch.optim as optim
# import torch.functional as F
from torch.utils.tensorboard import SummaryWriter

Related comments:

I did see this question how to run a python program on Condor? and this http://chtc.cs.wisc.edu/python-jobs.shtml but I can't believe we have to do that. Everyone else in the cluster doesn't do anything that complicated and I have run my scripts before without having to do anything complicated, I am very skeptical this is needed.

2

There are 2 answers

0
Charlie Parker On

I really do not understand how condor works but it seems that once I put the right path to python at the top for the current environment it started working. So check where is your python command:

(automl-meta-learning) miranda9~/automl-meta-learning $ which python
~/miniconda3/envs/automl-meta-learning/bin/python

then copy paste that to the top of your python submission script:

#!/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python

I wish I could include all of this in the job.sub. If you know how please let me know.


Reference solution: https://stackoverflow.com/a/64484025/1601580


Echoing's Christina's soln

put in your job script:

getenv = True

my current submission script:

####################
#
# Experiments script
# Simple HTCondor submit description file
#
# reference: https://gitlab.engr.illinois.edu/Vision/vision-gpu-servers/-/wikis/HTCondor-user-guide#submit-jobs
#
# chmod a+x test_condor.py
# chmod a+x experiments_meta_model_optimization.py
# chmod a+x meta_learning_experiments_submission.py
# chmod a+x download_miniImagenet.py
# chmod a+x ~/meta-learning-lstm-pytorch/main.py
# chmod a+x /home/miranda9/automl-meta-learning/automl-proj/meta_learning/datasets/rand_fc_nn_vec_mu_ls_gen.py
# chmod a+x /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/supervised_experiments_submission.py
# chmod a+x /home/miranda9/automl-meta-learning/results_plots/is_rapid_learning_real.py
# chmod a+x /home/miranda9/automl-meta-learning/test_condor.py
# chmod a+x /home/miranda9/ML4Coq/main.sh
# chmod a+x /home/miranda9/ML4Coq/ml4coq-proj/PosEval/download_data.py
# chmod a+x /home/miranda9/ML4Coq/ml4coq-proj/pos_eval/create_pos_eval_dataset.sh
# chmod a+x /home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py
# chmod a+x /home/miranda9/ML4Coq/main.sh
# condor_submit -i
# condor_submit job.sub
#
####################

# Executable = /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/supervised_experiments_submission.py

# Executable = /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
# SUBMIT_FILE = meta_learning_experiments_submission.py

# Executable = /home/miranda9/meta-learning-lstm-pytorch/main.py
# Executable = /home/miranda9/automl-meta-learning/automl-proj/meta_learning/datasets/rand_fc_nn_vec_mu_ls_gen.py

# Executable = /home/miranda9/automl-meta-learning/results_plots/is_rapid_learning_real.py
# SUBMIT_FILE = is_rapid_learning_real.py

# Executable = /home/miranda9/automl-meta-learning/test_condor.py

# Executable = /home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py
# SUBMIT_FILE = main_brando.py

# Executable = /home/miranda9/ML4Coq/ml4coq-proj/PosEval/download_data.py
# SUBMIT_FILE = ml4coq-proj/PosEval/download_data.py

# Executable = /home/miranda9/ML4Coq/ml4coq-proj/pos_eval/create_pos_eval_dataset.sh
# SUBMIT_FILE = create_pos_eval_dataset.sh

Executable = /home/miranda9/ML4Coq/main.sh
SUBMIT_FILE = main.sh

# Output Files
Log          = $(SUBMIT_FILE).log$(CLUSTER)
Output       = $(SUBMIT_FILE).o$(CLUSTER)
Error        = $(SUBMIT_FILE).o$(CLUSTER)

getenv = True
# cuda_version = 10.2
# cuda_version = 11.0

# Use this to make sure 1 gpu is available. The key words are case insensitive.
# REquest_gpus = 1
REquest_gpus = 2
requirements = (CUDADeviceName != "Tesla K40m")
requirements = (CUDADeviceName != "GeForce GTX TITAN X")
# requirements = (CUDADeviceName == "Quadro RTX 6000")
# requirements = ((CUDADeviceName != "Tesla K40m")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.gpus >= Requestgpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
# requirements = (CUDADeviceName == "Tesla K40m")
# requirements = (CUDADeviceName == "GeForce GTX TITAN X")

# Note: to use multiple CPUs instead of the default (one CPU), use request_cpus as well
# Request_cpus = 1
Request_cpus = 4
# Request_cpus = 5
# Request_cpus = 8
# Request_cpus = 16
# Request_cpus = 32

# E-mail option
Notify_user = [email protected]
Notification = always

Environment = MY_CONDOR_JOB_ID= $(CLUSTER)

# "Queue" means add the setup until this line to the queue (needs to be at the end of script).
Queue
2
Christina K On

HTCondor uses different default environments in interactive and batch jobs. Interactive jobs replicate the same shell environment as your login session (including the activated conda environment). Batch jobs begin with a VERY pared down environment (to see this in action, try running a test job with /usr/bin/env as the executable); an activated conda environment would not be carried forward into the batch job environment.

This behavior and potential submit file solutions are described here in the HTCondor manual: https://htcondor.readthedocs.io/en/latest/users-manual/services-for-jobs.html?highlight=environment#environment-variables