I am at my wits' end here. I have been trying to solve this issue that all of a sudden started happening with my recent deployments to AKS to no solution. I have gone through a lot of different resources for documentation, as well as various Stack Overflow questions and answers. I'm not a Kubernetes expert, but I'm trying.
My main issue is:
I have JBoss application I am deploying to AKS. The application is being deployed as a StatefulSet (replicas=2), and into the default namespace. The deployment creates the following services (all in the default namespace):
- demo-app-hs (headless service)
- Has no ClusterIP, and shows 2 pods (demo-app-depl-0 and 1) when I drill in.
- demo-app-service (non-headless service)
- Has a ClusterIP and an ExternalIP, and shows 2 pods (demo-app-depl-0 and 1) when I drill in.
- demo-app-service-lb (default lb using the Azure LoadBalancer)
- Has a ClusterIP and an ExternalIP, and shows 2 pods (demo-app-depl-0 and 1) when I drill in.
The first node comes up as 'demo-app-depl-0' and works perfectly fine. I can access it, no errors. The second node comes up as 'demo-app-depl-1' and in its' logs, I see the error thrown which leads me to believe that this pod cannot connect to the master pod in the cluster:
[exec] 2023-11-24 04:45:13.378+0000 ERROR [org.apache.activemq.artemis.core.client:877] {} (Thread-28 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@21918ea4)) AMQ214016: Failed to create netty connection: java.net.UnknownHostException: demo-app-depl-0
When I hop into the pod (demo-app-depl-1) and check the /etc/resolv.conf file, I see the following:
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.0.0.10
options ndots:5
When I run 'kubectl exec -i -t demo-app-depl-1 -- nslookup default.svc.cluster.local', I am returned:
Server: 10.0.0.10
Address: 10.0.0.10#53
*** Can't find default.svc.cluster.local: No answer
When I run 'kubectl exec -i -t demo-app-depl-1 -- nslookup demo-app-hs.default.svc.cluster.local', everything can be resolved fine:
Server: 10.0.0.10
Address: 10.0.0.10#53
Name: demo-app-hs.default.svc.cluster.local
Address: 10.244.2.6
Name: demo-app-hs.default.svc.cluster.local
Address: 10.244.2.7
To my knowledge, it's always been like this and have always been able to resolve the host fine. I haven't changed my method of deploying, which uses helm over the last year however, just recently, I started running into this issue. I'm not sure what to do at this point.
Any help would be appreciated, thank you.
Manually going into the the pod 'demo-app-depl-1' and updating updating the '/etc/resolv.conf' file to read 'demo-app-hs.default.svc.cluster.local' resolved the issue, however, this isn't ideal as I've never needed to add this line, and in the past, it resolved fine without needing this step.
The yaml for the headless service looks like this:
apiVersion: v1
kind: Service
metadata:
name: {{ include "deploy.fullname" . }}-hs
labels:
{{- include "deploy.labels" . | nindent 4 }}
spec:
selector:
app: {{ include "deploy.fullname" . }}-app-label
# type: LoadBalancer
clusterIP: None
ports:
...
The yaml for the for the jboss deployment looks like this:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: {{ include "deploy.fullname" . }}-depl
namespace: {{ .Values.application.namespace }}
labels:
{{- include "deploy.labels" . | nindent 4 }}
app: {{ include "deploy.fullname" . }}-app-label
date: "{{ now | unixEpoch }}"
spec:
replicas: {{ .Values.application.replicas }}
serviceName: {{ include "deploy.fullname" . }}-hs
selector:
matchLabels:
{{- include "deploy.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "deploy.selectorLabels" . | nindent 8 }}
app: {{ include "deploy.fullname" . }}-app-label
spec:
serviceAccountName: {{ include "deploy.fullname" . }}-clustering-service-account
securityContext:
fsGroup: 1067
containers:
- image: {{ .Values.application.image }}:{{ .Values.application.version }}
name: {{ .Values.application.containerName }}
command: ["ant"]
args: ['run', '-Dcontext-root={{ include "deploy.contextRoot" . }}', '-Dfile.encoding=utf8', '-Denv-name={{ .Values.application.envName }}', '-Dcacerts=/opt/jboss/standalone/configuration/cacerts']
imagePullPolicy: {{ .Values.application.imagePullPolicy }}
{{ if .Values.application.processLarge }}
resources:
requests:
cpu: 2000m
memory: 4096Mi
limits:
cpu: 3000m
memory: 8192Mi
{{ end }}
ports:
...
env:
- name: ENV_NAME
valueFrom:
configMapKeyRef:
name: {{ include "deploy.fullname" . }}-config-map
key: env-name
# ... other env variables ...
volumeMounts:
- name: jboss-data-vol
mountPath: /opt/jboss/standalone/data
- name: jboss-log-vol
mountPath: /opt/jboss/standalone/log
...
as per our discussion, I don't see anything wrong with the headless service and since you have already tried updating the '/etc/resolv.conf' file to read 'demo-app-hs.default.svc.cluster.local' , you can add in an instruction to the app.yaml that amends the /etc/resolv.conf file on startup to add the hostname with the service path such as "
<service>.default.svc.cluster.local
".