I am running a Apache Flink 1.16.2 standalone session cluster on K8s. When job is run with parallelism (>1) the job executes fine when each sub-task of a Job is allocated within a same Task Manager but when sub-task is distributed on distinct Task Managers the job fails instantly upon launch. I see below error on logs.
Caused by: org.apache.flink.runtime.io.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'xx.xxx.xxx.x/xx.xxx.xxx.x:36687 [yy.yyy.yyy.y:6122-0f75ed]. This might indicate that the remote task manager is lost. ....
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error obtaining sorted input: Thread 'sortmerger reading thread' terminated due to exception: connection unexpectedly closed by remote task manager 'xx.xxx.xxx.x/xx.xxx.xxx.x:36687 [yy.yyy.yyy.y:6122-0f75ed]'. This might indicated that the remote task manager is lost.
Each job trace directs to CreditbasedPartitionClientRequesthandler fails to get partition from other TM.
I tried modifying various memory options of managed and network memory, increased time outs. but no success.
Note: communication between task managers on data port is fine. The job runs perfectly with task parallelism is allocated to same task manager task slots and tasks distributed on multiple Task Managers within cluster. but fails only when single task paralleism is spread across task managers task slots.