Incorrect Hostname of pods in same stateful set in an AKS cluster

145 views Asked by At

I am at my wits' end here. I have been trying to solve this issue that all of a sudden started happening with my recent deployments to AKS to no solution. I have gone through a lot of different resources for documentation, as well as various Stack Overflow questions and answers. I'm not a Kubernetes expert, but I'm trying.

My main issue is:

I have JBoss application I am deploying to AKS. The application is being deployed as a StatefulSet (replicas=2), and into the default namespace. The deployment creates the following services (all in the default namespace):

  • demo-app-hs (headless service)
    • Has no ClusterIP, and shows 2 pods (demo-app-depl-0 and 1) when I drill in.
  • demo-app-service (non-headless service)
    • Has a ClusterIP and an ExternalIP, and shows 2 pods (demo-app-depl-0 and 1) when I drill in.
  • demo-app-service-lb (default lb using the Azure LoadBalancer)
    • Has a ClusterIP and an ExternalIP, and shows 2 pods (demo-app-depl-0 and 1) when I drill in.

The first node comes up as 'demo-app-depl-0' and works perfectly fine. I can access it, no errors. The second node comes up as 'demo-app-depl-1' and in its' logs, I see the error thrown which leads me to believe that this pod cannot connect to the master pod in the cluster:

[exec] 2023-11-24 04:45:13.378+0000 ERROR [org.apache.activemq.artemis.core.client:877] {} (Thread-28 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@21918ea4)) AMQ214016: Failed to create netty connection: java.net.UnknownHostException: demo-app-depl-0

When I hop into the pod (demo-app-depl-1) and check the /etc/resolv.conf file, I see the following:

search default.svc.cluster.local svc.cluster.local cluster.local

nameserver 10.0.0.10

options ndots:5

When I run 'kubectl exec -i -t demo-app-depl-1 -- nslookup default.svc.cluster.local', I am returned:

Server: 10.0.0.10

Address: 10.0.0.10#53

*** Can't find default.svc.cluster.local: No answer

When I run 'kubectl exec -i -t demo-app-depl-1 -- nslookup demo-app-hs.default.svc.cluster.local', everything can be resolved fine:

Server: 10.0.0.10

Address: 10.0.0.10#53

Name: demo-app-hs.default.svc.cluster.local

Address: 10.244.2.6

Name: demo-app-hs.default.svc.cluster.local

Address: 10.244.2.7

To my knowledge, it's always been like this and have always been able to resolve the host fine. I haven't changed my method of deploying, which uses helm over the last year however, just recently, I started running into this issue. I'm not sure what to do at this point.

Any help would be appreciated, thank you.

Manually going into the the pod 'demo-app-depl-1' and updating updating the '/etc/resolv.conf' file to read 'demo-app-hs.default.svc.cluster.local' resolved the issue, however, this isn't ideal as I've never needed to add this line, and in the past, it resolved fine without needing this step.

The yaml for the headless service looks like this:

apiVersion: v1
kind: Service
metadata:
  name: {{ include "deploy.fullname" . }}-hs
  labels:
    {{- include "deploy.labels" . | nindent 4 }}
spec:
  selector:
    app: {{ include "deploy.fullname" . }}-app-label
#  type: LoadBalancer
  clusterIP: None
  ports:
    ...

The yaml for the for the jboss deployment looks like this:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: {{ include "deploy.fullname" . }}-depl
  namespace: {{ .Values.application.namespace }}
  labels:
    {{- include "deploy.labels" . | nindent 4 }}
    app: {{ include "deploy.fullname" . }}-app-label
    date: "{{ now | unixEpoch }}"
spec:
  replicas: {{ .Values.application.replicas }}
  serviceName: {{ include "deploy.fullname" . }}-hs
  selector:
    matchLabels:
      {{- include "deploy.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "deploy.selectorLabels" . | nindent 8 }}
        app: {{ include "deploy.fullname" . }}-app-label
    spec:
      serviceAccountName: {{ include "deploy.fullname" . }}-clustering-service-account
      securityContext:
        fsGroup: 1067
      containers:
        - image: {{ .Values.application.image }}:{{ .Values.application.version }}
          name: {{ .Values.application.containerName }}
          command: ["ant"]
          args: ['run', '-Dcontext-root={{ include "deploy.contextRoot" . }}', '-Dfile.encoding=utf8', '-Denv-name={{ .Values.application.envName }}', '-Dcacerts=/opt/jboss/standalone/configuration/cacerts']
          imagePullPolicy: {{ .Values.application.imagePullPolicy }}
          {{ if .Values.application.processLarge }}
          resources:
            requests:
              cpu: 2000m
              memory: 4096Mi
            limits:
              cpu: 3000m
              memory: 8192Mi
          {{ end }}
          ports:
            ...
          env:
            - name: ENV_NAME
              valueFrom:
                configMapKeyRef:
                  name: {{ include "deploy.fullname" . }}-config-map
                  key: env-name
            # ... other env variables ...
          volumeMounts:
            - name: jboss-data-vol
              mountPath: /opt/jboss/standalone/data
            - name: jboss-log-vol
              mountPath: /opt/jboss/standalone/log
         ...
1

There are 1 answers

0
Arko On BEST ANSWER

as per our discussion, I don't see anything wrong with the headless service and since you have already tried updating the '/etc/resolv.conf' file to read 'demo-app-hs.default.svc.cluster.local' , you can add in an instruction to the app.yaml that amends the /etc/resolv.conf file on startup to add the hostname with the service path such as "<service>.default.svc.cluster.local".