I’m having a problem when I try to run a subprocess (with Popen) in my python script that executes a bash command (slurm sbatch) on a different computing node.

The error happens during wandb.init(): wandb.sdk.wandb_manager.ManagerConnectionRefusedError: Connection to wandb service failed since the process is not available.

The sbatch command starts a job on a different node and looks like this: p = Popen([shutil.which("sbatch"), '--mem=40G', '--gres=gpu:titan_xp:1', '--nodelist=tikgpu02', '--cpus-per-task=2', '--output=/home/pschlaepfer/denselp/slt/log/%j.out', '--error=/home/pschlaepfer/denselp/slt/log/%j.err', '/home/pschlaepfer/denselp/slt/scripts/slt.sh', '--action=fine-tune-thf', '--max-length', '128', '--lr=4e-5', '--epochs=5', '--batch-size=16', '--task', task, '--pre-trained-path', checkpoint_path, '--wandb-mode=offline'], start_new_session=True)

wandb.init() is called like that:

experiment_name = f"job-id:{meta_config.job_id}"
run = wandb.init(
  project=wandb_project_choice+("-proto" if meta_config.is_debug_instance else ""),
  name=experiment_name,
  tags=[
    "job_id:"+str(meta_config.job_id)
  ],
  settings=wandb.Settings(start_method='fork'),
  dir=wandb_logging_dir_path,
  config=dict(experiment_config._asdict()) if type(experiment_config).__name__ == 'ExperimentConfig' else dict(experiment_config._as_dict()),
  reinit=True,
  mode="offline",
)

And here's the whole stacktrace:

    Traceback (most recent call last):
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 115, in _service_connect
    svc_iface._svc_connect(port=port)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/service/service_sock.py”, line 30, in _svc_connect
    self._sock_client.connect(port=port)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py”, line 102, in connect
    s.connect((“localhost”, port))
    ConnectionRefusedError: [Errno 111] Connection refused

    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/runpy.py”, line 86, in _run_code
    exec(code, run_globals)
    File “/home/pschlaepfer/denselp/slt/main.py”, line 107, in
    run = wandb.init(
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 1185, in init
    raise e
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 1162, in init
    wi.setup(kwargs)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 189, in setup
    self._wl = wandb_setup.setup(settings=setup_settings)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 327, in setup
    ret = _setup(settings=settings)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 320, in _setup
    wl = _WandbSetup(settings=settings)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 303, in init
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 114, in init
    self._setup()
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 250, in _setup
    self._setup_manager()
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 277, in _setup_manager
    self._manager = wandb_manager._Manager(settings=self._settings)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 152, in init
    wandb._sentry.reraise(e)
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/analytics/sentry.py”, line 154, in reraise
    raise exc.with_traceback(sys.exc_info()[2])
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 150, in init
    self._service_connect()
    File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 124, in _service_connect
    raise ManagerConnectionRefusedError(message)
    wandb.sdk.wandb_manager.ManagerConnectionRefusedError: Connection to wandb service failed since the process is not available.

Wandb version used is 0.16.0

Thank you very much for your help!

0

There are 0 answers