I’m having a problem when I try to run a subprocess (with Popen) in my python script that executes a bash command (slurm sbatch) on a different computing node.
The error happens during wandb.init():
wandb.sdk.wandb_manager.ManagerConnectionRefusedError: Connection to wandb service failed since the process is not available.
The sbatch command starts a job on a different node and looks like this:
p = Popen([shutil.which("sbatch"), '--mem=40G', '--gres=gpu:titan_xp:1', '--nodelist=tikgpu02', '--cpus-per-task=2', '--output=/home/pschlaepfer/denselp/slt/log/%j.out', '--error=/home/pschlaepfer/denselp/slt/log/%j.err', '/home/pschlaepfer/denselp/slt/scripts/slt.sh', '--action=fine-tune-thf', '--max-length', '128', '--lr=4e-5', '--epochs=5', '--batch-size=16', '--task', task, '--pre-trained-path', checkpoint_path, '--wandb-mode=offline'], start_new_session=True)
wandb.init() is called like that:
experiment_name = f"job-id:{meta_config.job_id}"
run = wandb.init(
project=wandb_project_choice+("-proto" if meta_config.is_debug_instance else ""),
name=experiment_name,
tags=[
"job_id:"+str(meta_config.job_id)
],
settings=wandb.Settings(start_method='fork'),
dir=wandb_logging_dir_path,
config=dict(experiment_config._asdict()) if type(experiment_config).__name__ == 'ExperimentConfig' else dict(experiment_config._as_dict()),
reinit=True,
mode="offline",
)
And here's the whole stacktrace:
Traceback (most recent call last):
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 115, in _service_connect
svc_iface._svc_connect(port=port)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/service/service_sock.py”, line 30, in _svc_connect
self._sock_client.connect(port=port)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py”, line 102, in connect
s.connect((“localhost”, port))
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/home/pschlaepfer/denselp/slt/main.py”, line 107, in
run = wandb.init(
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 1185, in init
raise e
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 1162, in init
wi.setup(kwargs)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 189, in setup
self._wl = wandb_setup.setup(settings=setup_settings)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 327, in setup
ret = _setup(settings=settings)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 320, in _setup
wl = _WandbSetup(settings=settings)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 303, in init
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 114, in init
self._setup()
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 250, in _setup
self._setup_manager()
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 277, in _setup_manager
self._manager = wandb_manager._Manager(settings=self._settings)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 152, in init
wandb._sentry.reraise(e)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/analytics/sentry.py”, line 154, in reraise
raise exc.with_traceback(sys.exc_info()[2])
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 150, in init
self._service_connect()
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 124, in _service_connect
raise ManagerConnectionRefusedError(message)
wandb.sdk.wandb_manager.ManagerConnectionRefusedError: Connection to wandb service failed since the process is not available.
Wandb version used is 0.16.0
Thank you very much for your help!