Connection problems with dask scheduler

109 views Asked by At

I've set up a kubernetes cluster with GKE and installed the dask-kubernetes-operator. When i try to start the cluster like this

cluster: KubeCluster = KubeCluster(custom_cluster_spec="cluster.yaml")
client = Client(cluster)
client

where the .yaml is basically the cluster-spec.yaml example from this website but with my own image (based on ghcr.io/dask/dask:2023.10.0-py3.10), i get the following error message, often multiple times in a row:

Task exception was never retrieved
future: <Task finished name='Task-822' coro=<PortForward._sync_sockets() done, defined at 
/opt/conda/lib/python3.10/site-packages/kr8s/_portforward.py:167> exception=ExceptionGroup('unhandled errors in a 
TaskGroup', [ConnectionClosedError('TCP socket closed')])>
  + Exception Group Traceback (most recent call last):
  |   File "/opt/conda/lib/python3.10/site-packages/kr8s/_portforward.py", line 171, in _sync_sockets
  |     async with anyio.create_task_group() as tg:
  |   File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 664, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/opt/conda/lib/python3.10/site-packages/kr8s/_portforward.py", line 183, in _tcp_to_ws
    |     raise ConnectionClosedError("TCP socket closed")
    | kr8s._exceptions.ConnectionClosedError: TCP socket closed
    +------------------------------------

In the scheduler pod logs it also says

distributed.comm.tcp - INFO - Connection from tcp://127.0.0.1:51484 closed before handshake completed

+ '[' '' ']'
+ '[' '' == true ']'
+ CONDA_BIN=/opt/conda/bin/conda
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
+ '[' '' ']'
+ '[' '' ']'
+ exec dask-scheduler
no environment.yml
/opt/conda/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py:142: FutureWarning: dask-scheduler is deprecated and will be removed in a future release; use `dask scheduler` instead
  warnings.warn(
2023-11-23 12:35:29,548 - distributed.scheduler - INFO - -----------------------------------------------
2023-11-23 12:35:30,422 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-11-23 12:35:30,473 - distributed.scheduler - INFO - State start
2023-11-23 12:35:30,478 - distributed.scheduler - INFO - -----------------------------------------------
2023-11-23 12:35:30,479 - distributed.scheduler - INFO -   Scheduler at:     tcp://10.12.0.35:8786
2023-11-23 12:35:30,480 - distributed.scheduler - INFO -   dashboard at:  http://10.12.0.35:8787/status
2023-11-23 12:35:30,480 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2023-11-23 12:41:27,053 - distributed.comm.tcp - INFO - Connection from tcp://127.0.0.1:51484 closed before handshake completed
2023-11-23 12:41:54,805 - distributed.scheduler - INFO - Receive client connection: Client-adc9e3c2-89fd-11ee-8284-0242ac120003
2023-11-23 12:41:54,806 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:38964
(base) root@1675432b9888:/workspaces/trading_bot# kubectl delete daskclusters example

I've tried increasing InitialDelaySeconds and checked that versions match, but that didn't help. Couldn't find much else about this error online.

1

There are 1 answers

0
Jacob Tomlinson On BEST ANSWER

This is just a noisy warning and shouldn't stop your code from working.

The bug that caused the warning has now been fixed so upgrading to new versions should resolve this.