transport: Error while dialing dial tcp xx.xx.xx.xx15012: i/o timeout with AWS-EKS + Terraform + Istio

6.8k views Asked by At

I setup a (what I think) is a bog standard EKS cluster using terraform-aws-eks like so:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 18.0"

  cluster_name    = "my-test-cluster"
  cluster_version = "1.21"

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {}
    vpc-cni = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  eks_managed_node_group_defaults = {
    disk_size      = 50
    instance_types = ["m5.large"]
  }

  eks_managed_node_groups = {
    green_test = {
      min_size     = 1
      max_size     = 2
      desired_size = 2

      instance_types = ["t3.large"]
      capacity_type  = "SPOT"
    }
  }
}

then tried to install Istio via the install docs

istioctl install

which resulted in this:

✔ Istio core installed
✔ Istiod installed
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
  Deployment/istio-system/istio-ingressgateway (containers with unready status: [istio-proxy])
- Pruning removed resources                                                                                    Error: failed to install manifests: errors occurred during operation

so I did a bit of digging:

kubectl logs istio-ingressgateway-7fd568fc99-6ql8h -n istio-system

led to

2022-04-17T13:51:14.540346Z warn    ca  ca request failed, starting attempt 1 in 90.275446ms
2022-04-17T13:51:14.631695Z warn    ca  ca request failed, starting attempt 2 in 195.118437ms
2022-04-17T13:51:14.827286Z warn    ca  ca request failed, starting attempt 3 in 394.627125ms
2022-04-17T13:51:15.222738Z warn    ca  ca request failed, starting attempt 4 in 816.437569ms
2022-04-17T13:51:16.039427Z warn    sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:51:33.941084Z warning envoy config    StreamAggregatedResources gRPC config stream closed since 318s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:05.830859Z warning envoy config    StreamAggregatedResources gRPC config stream closed since 350s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:26.232441Z warning envoy config    StreamAggregatedResources gRPC config stream closed since 370s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"

So from a lot of reading, it seems like maybe the istio-ingressgateway pod is not able to connect to istiod?

Google time, I find this: https://istio.io/latest/docs/ops/diagnostic-tools/proxy-cmd/#verifying-connectivity-to-istiod

kubectl create namespace foo
kubectl apply -f <(istioctl kube-inject -f samples/sleep/sleep.yaml) -n foo

kubectl exec $(kubectl get pod -l app=sleep -n foo -o jsonpath={.items..metadata.name}) -c sleep -n foo -- curl -sS istiod.istio-system:15014/version

which gives me:

curl: (7) Failed to connect to istiod.istio-system port 15014 after 4 ms: Connection refused
command terminated with exit code 7

So I think this problem is not specific to the istio-ingressgateway, but a more general networking issue in a standard EKS cluster?

  1. How would I go about debugging from here, to figure out what the problem is? Are there good resources to understand the networking model of kubernetes and istio?
  2. How come the istio platform docs leave off EKS? Does the istio team not want istio to run on AWS-EKS?
  3. Does this seem like an issue that should be filed against EKS? The aws-eks Terraform module? Istio? I'm not sure exactly where it lands and it seems if I ask for help from one team - another team would almost certainly need to be invloved.
  4. Are there known incompatibilities with Istio and EKS that I should be aware of?

Thanks in advance!

[22-04-18] Update 1:

Ok, so the test with the foo namespace sleep pod leads me to believe that the connection timeout has to do with aws security group rules. The theory is, if security group ports are not opened, you'd see the sort of "connection refused" "io timeout" messages that I see. To test the theory I took the 4 security groups that are created by this module

  1. k8s/EKS/Amazon SG
  2. EKS ENI SG
  3. EKS Cluster SG
  4. EKS Shared node group SG

and opened all traffic up inbound/outbound on all of them.

istioctl install
This will install the Istio 1.13.2 default profile with ["Istio core" "Istiod" "Ingress gateways"] components into the cluster. Proceed? (y/N) y
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete                                                                                                                                                       Making this installation the default for injection and validation.

Et viola! Ok, now I think I need to work backwards and isolate -which- ports and what security group to apply them to, and if they are on the inbound or outbound side. Once I have those, I can PR it back to terraform-aws-eks and save someone else hours of headache.

[22-04-22] Update 2:

Ultimately, I solved this issue - but ran into one more Very Common problem that I saw many others ran into, and had the answer for, but not in a usable format for the terraform-aws-eks module.

After I was able to get the istioctl install to work correctly:

istioctl install --set profile=demo
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete                                                                                                                                                       Making this installation the default for injection and validation.

kubectl label namespace default istio-injection=enabled

kubectl apply -f istio-1.13.2/samples/bookinfo/platform/kube/bookinfo.yaml

I saw all the bookinfo pods/deployments fail to start with this:

Internal error occurred: failed calling 
webhook "namespace.sidecar-injector.istio.io": failed to 
call webhook: Post "https://istiod.istio-system.svc:443
/inject?timeout=10s": context deadline exceeded

The answer to the is problem is similar to the original problem: working fw ports / security group rules. I've added a separate answer below for clarity. It contains a complete working solution of AWS-EKS + Terraform + Istio

4

There are 4 answers

3
mitchellmc On BEST ANSWER

BLUF: Installing Istio on terraform-aws-eks requires you to add security group rules allowing communication within the node group. You need:

  1. To add security group rules(ingress/egress) within the shared node security group to open istio ports for istio to install correctly
  2. To add one ingress security group rule on the node security group, from the control plane(EKS) security group for 15017, to resolve the failed calling webhook "namespace.sidecar-injector.istio.io" error.

Unfortunately, I still don't know why this works since I don't yet understand the order of operations that happens when an istio injected pod comes up in a kubernetes cluster, and who tries to talk to who.

Research resources

  1. A diagram of the security group architecture for an EKS cluster created by terraform-aws-eks
  2. The ports Istio needs open
  3. A youtube video explaining CNI
  4. The ports Kubernetes uses A diagram of the security group architecture for an EKS cluster created by terraform-aws-eks

Working Example

Please see the comments for which sets of rules solves which of the two problems from the original answer

# Ports needed to correctly install Istio for the error message: transport: Error while dialing dial tcp xx.xx.xx.xx15012: i/o timeout
locals {
  istio_ports = [
    {
      description = "Envoy admin port / outbound"
      from_port   = 15000
      to_port     = 15001
    },
    {
      description = "Debug port"
      from_port   = 15004
      to_port     = 15004
    },
    {
      description = "Envoy inbound"
      from_port   = 15006
      to_port     = 15006
    },
    {
      description = "HBONE mTLS tunnel port / secure networks XDS and CA services (Plaintext)"
      from_port   = 15008
      to_port     = 15010
    },
    {
      description = "XDS and CA services (TLS and mTLS)"
      from_port   = 15012
      to_port     = 15012
    },
    {
      description = "Control plane monitoring"
      from_port   = 15014
      to_port     = 15014
    },
    {
      description = "Webhook container port, forwarded from 443"
      from_port   = 15017
      to_port     = 15017
    },
    {
      description = "Merged Prometheus telemetry from Istio agent, Envoy, and application, Health checks"
      from_port   = 15020
      to_port     = 15021
    },
    {
      description = "DNS port"
      from_port   = 15053
      to_port     = 15053
    },
    {
      description = "Envoy Prometheus telemetry"
      from_port   = 15090
      to_port     = 15090
    },
    {
      description = "aws-load-balancer-controller"
      from_port   = 9443
      to_port     = 9443
    }
  ]

  ingress_rules = {
    for ikey, ivalue in local.istio_ports :
    "${ikey}_ingress" => {
      description = ivalue.description
      protocol    = "tcp"
      from_port   = ivalue.from_port
      to_port     = ivalue.to_port
      type        = "ingress"
      self        = true
    }
  }

  egress_rules = {
    for ekey, evalue in local.istio_ports :
    "${ekey}_egress" => {
      description = evalue.description
      protocol    = "tcp"
      from_port   = evalue.from_port
      to_port     = evalue.to_port
      type        = "egress"
      self        = true
    }
  }
}

# The AWS-EKS Module definition
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 18.0"

  cluster_name    = "my-test-cluster"
  cluster_version = "1.21"

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {}
    vpc-cni = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  eks_managed_node_group_defaults = {
    disk_size      = 50
    instance_types = ["m5.large"]
  }

  # IMPORTANT
  node_security_group_additional_rules = merge(
    local.ingress_rules,
    local.egress_rules
  )

  eks_managed_node_groups = {
    green_test = {
      min_size     = 1
      max_size     = 2
      desired_size = 2

      instance_types = ["t3.large"]
      capacity_type  = "SPOT"
    }
  }
}

# Port needed to solve the error
# Internal error occurred: failed calling 
# webhook "namespace.sidecar-injector.istio.io": failed to 
# call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": # context deadline exceeded
resource "aws_security_group_rule" "allow_sidecar_injection" {
  description = "Webhook container port, From Control Plane"
  protocol    = "tcp"
  type        = "ingress"
  from_port   = 15017
  to_port     = 15017

  security_group_id        = module.eks.node_security_group_id
  source_security_group_id = module.eks.cluster_primary_security_group_id
}

Please excuse my possibly terrible Terraform syntax usage. Happy Kuberneteing!

1
noodlemind On

This is a common error in the BareMetal Cluster as well. In most cases, this is due to the memory constraint on the RAM. To isolate the issue, please try to use minimal profile than the demo.

istioctl install profile=minimal -y 
4
confiq On

@mitchellmc did a great job in asking the question and even better job answering it!

As they said, terraform-aws-eks by default does not allow network communication between the nodes. To allow it and to avoid problems like these, you can do it by having this in your module inputs:

  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  }

If you use Istio, AWS SG are little redundant and you should know what you are doing.

Happy Istioing :)

0
Daniel Andrzejewski On

In case you use self-managed kubernetes cluster: open tcp traffic on port 15012 on all nodes.

# iptables -I INPUT -p tcp -m tcp --dport 15012 -m state --state NEW -j ACCEPT

After some time it should start working and istio-ingressgateway and istio-egressgateway pods should go into the running state.

This is what I saw in the logs of istio-ingressgateway after I opened 15012 port on the last node:

2023-02-16T08:59:52.164140Z     warn    ca      ca request failed, starting attempt 1 in 101.771077ms
2023-02-16T08:59:52.266661Z     warn    ca      ca request failed, starting attempt 2 in 203.277481ms
2023-02-16T08:59:52.470338Z     warn    ca      ca request failed, starting attempt 3 in 414.02262ms
2023-02-16T08:59:52.885009Z     warn    ca      ca request failed, starting attempt 4 in 802.104302ms
2023-02-16T08:59:53.687642Z     warn    sds     failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.235.39.26:15012: i/o timeout"
2023-02-16T08:59:54.299130Z     warn    ca      ca request failed, starting attempt 1 in 94.651471ms
2023-02-16T08:59:54.394576Z     warn    ca      ca request failed, starting attempt 2 in 208.8855ms
2023-02-16T08:59:54.604578Z     warn    ca      ca request failed, starting attempt 3 in 370.613815ms
2023-02-16T08:59:55.002692Z     info    cache   generated new workload certificate      latency=1.314865678s ttl=23h59m58.997320315s
2023-02-16T08:59:55.002722Z     info    cache   Root cert has changed, start rotating root cert
2023-02-16T08:59:55.002735Z     info    ads     XDS: Incremental Pushing:0 ConnectedEndpoints:0 Version:
2023-02-16T08:59:55.002839Z     info    cache   returned workload trust anchor from cache       ttl=23h59m58.997167622s
2023-02-16T08:59:59.102005Z     info    ads     ADS: new connection for node:istio-ingressgateway-78c66865cc-7prwb.istio-system-15
2023-02-16T08:59:59.102294Z     info    cache   returned workload certificate from cache        ttl=23h59m54.897716667s
2023-02-16T08:59:59.102755Z     info    ads     SDS: PUSH request for node:istio-ingressgateway-78c66865cc-7prwb.istio-system resources:1 size:4.0kB resource:default
2023-02-16T09:00:03.584613Z     info    ads     ADS: new connection for node:istio-ingressgateway-78c66865cc-7prwb.istio-system-16
2023-02-16T09:00:03.585645Z     info    cache   returned workload trust anchor from cache       ttl=23h59m50.414378468s
2023-02-16T09:00:03.586295Z     info    ads     SDS: PUSH request for node:istio-ingressgateway-78c66865cc-7prwb.istio-system resources:1 size:1.1kB resource:ROOTCA
2023-02-16T09:00:04.523791Z     info    Readiness succeeded in 25m10.842678325s
2023-02-16T09:00:04.524790Z     info    Envoy proxy is ready