YARN job fails due to the connection issue

68 views Asked by At

I've hadoop-3.3.6 setup in the Kubernetes cluster, all the hadoop components are exposed via ClusterIP services, I'm able to telnet to the ports that are exposed from respective pods. But when I run the example job from datanode pod (tried even from the resourcemanager pod), I get the following error

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /README.txt /MROutput

2023-12-08 21:18:57,722 INFO [main] org.apache.hadoop.security.SecurityUtil: Updating Configuration
2023-12-08 21:18:58,323 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2023-12-08 21:18:58,544 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2023-12-08 21:18:58,544 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2023-12-08 21:18:58,714 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens: [Kind: mapreduce.job, Service: job_1702069922844_0002, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@71c3b41)]
2023-12-08 21:18:58,827 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2023-12-08 21:18:59,992 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: nodemanager.hadoop.svc.cluster.local.hadoop/10.233.51.169:37127. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2023-12-08 21:19:00,994 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: nodemanager.hadoop.svc.cluster.local.hadoop/10.233.51.169:37127. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2023-12-08 21:19:01,996 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: nodemanager.hadoop.svc.cluster.local.hadoop/10.233.51.169:37127. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2023-12-08 21:19:02,003 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From nodemanager-8fc5cdf9d-q9kwx/10.233.74.107 to nodemanager.hadoop.svc.cluster.local.hadoop:37127 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:930)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:845)
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1571)
    at org.apache.hadoop.ipc.Client.call(Client.java:1513)
    at org.apache.hadoop.ipc.Client.call(Client.java:1410)
    at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251)
    at com.sun.proxy.$Proxy8.getTask(Unknown Source)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:140)
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:205)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:600)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:652)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:773)
    at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:347)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1632)
    at org.apache.hadoop.ipc.Client.call(Client.java:1457)
    ... 4 more

2023-12-08 21:19:02,004 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2023-12-08 21:19:02,005 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2023-12-08 21:19:02,006 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

The nodemanager.hadoop.svc.cluster.local.hadoop is my service hostname which is correct but the PORT 37127 is not opened in that service and its random as I run the job every time, so I cannot expose that port out of my service

I'm running above job from my datanode which is able to connect to nodemanager.hadoop.svc.cluster.local.hadoop service but via diff. IPs that are exposed. I could connect from nodemanager-8fc5cdf9d-q9kwx/10.233.74.107 pod to the service via diff. IPs that are exposed via that service.

Also when I tried to check the port creation using netstat inside the nodemanager while running the job, tcp6 PORT 37127 does exist as long as the job was run.

Looks like I'm missing some config. settings? Could someone please help me? (struggling for a couple of days on this)

The Job finishes with failure. Job status after completion

1

There are 1 answers

0
nobso On

I've fixed the port range using yarn.app.mapreduce.am.job.client.port-range, then expose them via kubernetes service, it works now.