I setup a (what I think) is a bog standard EKS cluster using terraform-aws-eks like so:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 18.0"
cluster_name = "my-test-cluster"
cluster_version = "1.21"
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
cluster_addons = {
coredns = {
resolve_conflicts = "OVERWRITE"
}
kube-proxy = {}
vpc-cni = {
resolve_conflicts = "OVERWRITE"
}
}
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
eks_managed_node_group_defaults = {
disk_size = 50
instance_types = ["m5.large"]
}
eks_managed_node_groups = {
green_test = {
min_size = 1
max_size = 2
desired_size = 2
instance_types = ["t3.large"]
capacity_type = "SPOT"
}
}
}
then tried to install Istio via the install docs
istioctl install
which resulted in this:
✔ Istio core installed
✔ Istiod installed
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
Deployment/istio-system/istio-ingressgateway (containers with unready status: [istio-proxy])
- Pruning removed resources Error: failed to install manifests: errors occurred during operation
so I did a bit of digging:
kubectl logs istio-ingressgateway-7fd568fc99-6ql8h -n istio-system
led to
2022-04-17T13:51:14.540346Z warn ca ca request failed, starting attempt 1 in 90.275446ms
2022-04-17T13:51:14.631695Z warn ca ca request failed, starting attempt 2 in 195.118437ms
2022-04-17T13:51:14.827286Z warn ca ca request failed, starting attempt 3 in 394.627125ms
2022-04-17T13:51:15.222738Z warn ca ca request failed, starting attempt 4 in 816.437569ms
2022-04-17T13:51:16.039427Z warn sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:51:33.941084Z warning envoy config StreamAggregatedResources gRPC config stream closed since 318s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:05.830859Z warning envoy config StreamAggregatedResources gRPC config stream closed since 350s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:26.232441Z warning envoy config StreamAggregatedResources gRPC config stream closed since 370s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
So from a lot of reading, it seems like maybe the istio-ingressgateway pod is not able to connect to istiod?
Google time, I find this: https://istio.io/latest/docs/ops/diagnostic-tools/proxy-cmd/#verifying-connectivity-to-istiod
kubectl create namespace foo
kubectl apply -f <(istioctl kube-inject -f samples/sleep/sleep.yaml) -n foo
kubectl exec $(kubectl get pod -l app=sleep -n foo -o jsonpath={.items..metadata.name}) -c sleep -n foo -- curl -sS istiod.istio-system:15014/version
which gives me:
curl: (7) Failed to connect to istiod.istio-system port 15014 after 4 ms: Connection refused
command terminated with exit code 7
So I think this problem is not specific to the istio-ingressgateway, but a more general networking issue in a standard EKS cluster?
- How would I go about debugging from here, to figure out what the problem is? Are there good resources to understand the networking model of kubernetes and istio?
- How come the istio platform docs leave off EKS? Does the istio team not want istio to run on AWS-EKS?
- Does this seem like an issue that should be filed against EKS? The aws-eks Terraform module? Istio? I'm not sure exactly where it lands and it seems if I ask for help from one team - another team would almost certainly need to be invloved.
- Are there known incompatibilities with Istio and EKS that I should be aware of?
Thanks in advance!
[22-04-18] Update 1:
Ok, so the test with the foo namespace sleep pod leads me to believe that the connection timeout has to do with aws security group rules. The theory is, if security group ports are not opened, you'd see the sort of "connection refused" "io timeout" messages that I see. To test the theory I took the 4 security groups that are created by this module
- k8s/EKS/Amazon SG
- EKS ENI SG
- EKS Cluster SG
- EKS Shared node group SG
and opened all traffic up inbound/outbound on all of them.
istioctl install
This will install the Istio 1.13.2 default profile with ["Istio core" "Istiod" "Ingress gateways"] components into the cluster. Proceed? (y/N) y
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete Making this installation the default for injection and validation.
Et viola! Ok, now I think I need to work backwards and isolate -which- ports and what security group to apply them to, and if they are on the inbound or outbound side. Once I have those, I can PR it back to terraform-aws-eks and save someone else hours of headache.
[22-04-22] Update 2:
Ultimately, I solved this issue - but ran into one more Very Common problem that I saw many others ran into, and had the answer for, but not in a usable format for the terraform-aws-eks module.
After I was able to get the istioctl install to work correctly:
istioctl install --set profile=demo
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete Making this installation the default for injection and validation.
kubectl label namespace default istio-injection=enabled
kubectl apply -f istio-1.13.2/samples/bookinfo/platform/kube/bookinfo.yaml
I saw all the bookinfo pods/deployments fail to start with this:
Internal error occurred: failed calling
webhook "namespace.sidecar-injector.istio.io": failed to
call webhook: Post "https://istiod.istio-system.svc:443
/inject?timeout=10s": context deadline exceeded
The answer to the is problem is similar to the original problem: working fw ports / security group rules. I've added a separate answer below for clarity. It contains a complete working solution of AWS-EKS + Terraform + Istio
BLUF: Installing Istio on terraform-aws-eks requires you to add security group rules allowing communication within the node group. You need:
failed calling webhook "namespace.sidecar-injector.istio.io"
error.Unfortunately, I still don't know why this works since I don't yet understand the order of operations that happens when an istio injected pod comes up in a kubernetes cluster, and who tries to talk to who.
Research resources
Working Example
Please see the comments for which sets of rules solves which of the two problems from the original answer
Please excuse my possibly terrible Terraform syntax usage. Happy Kuberneteing!