EC2 instance can't access amazon-linux repos (eg amazon-linux-extras install docker) through s3 gateway endpoint

7.8k views Asked by At

I'm having s3 endpoint grief. When my instances initialize they can not install docker. Details:

I have ASG instances sitting in a VPC with pub and private subnets. Appropriate routing and EIP/NAT is all stitched up.Instances in private subnets have outbouond 0.0.0.0/0 routed to NAT in respective public subnets. NACLs for public subnet allow internet traffic in and out, the NACLs around private subnets allow traffic from public subnets in and out, traffic out to the internet (and traffic from s3 cidrs in and out). I want it pretty locked down.

  • I have DNS and hostnames enabled in my VPC
  • I understand NACLs are stateless and have enabled IN and OUTBOUND rules for s3 amazon IP cidr blocks on ephemeral port ranges (yes I have also enabled traffic between pub and private subnets)
  • yes I have checked a route was provisioned for my s3 endpoint in my private route tables
  • yes I know for sure it is the s3 endpoint causing me grief and not another blunder -> when I delete it and open up my NACLs I can yum update and install docker (as expected) I am not looking for suggestions that require opening up my NACLs, I'm using a VPC gateway endpiont because I want to keep things locked down in the private subnets. I mention this because similar discussions seem to say 'I opened 0.0.0.0/0 on all ports and now x works'
  • Should I just bake an AMI with docker installed? That's what I'll do if I can't resolve this. I really wanted to set up my networking so everything is nicely locked down and feel like it should be pretty straight forward utilizing endpoints. Largely this is a networking exercise so I would rather not do this because it avoids solving and understanding the problem.
  • I know my other VPC endpoints work perfectly -> Auto-scaling service interface endpoint is performing (I can see it scaling down instances as per the policy), SSM interface endpoint allowing me to use session manager, and ECR endpoint(s) are working in conjunction with s3 gateway endpoint (s3 gateway endpoint is required because image layers are in s3) -> I know this works because if I open up NACLS and delete my s3 endpoint and install docker, then lock everything down again, bring back my s3 gatewayendpoint I can successfully pull my ECR images. SO the s3 gateway endpoint is fine for accessing ecr image layers, but not amazon-linux-extra repos.
  • SGs attached to instances are not the problem (instances have default outbound rule)
  • I have tried adding increasingly generous policies to my s3 endpoint as I have seen in this 7 year old thread and thought this had to do the trick (yes I subbed my region correctly)
  • I strongly feel the solution lies with the s3 gateway policy as discussed in this thread, however have had little luck with my increasingly desperate policies.

Amazon EC2 instance can't update or use yum

another s3 struggle with resolution:

https://blog.saieva.com/2020/08/17/aws-s3-endpoint-gateway-access-for-linux-2-amis-resolving-http-403-forbidden-error/

I have tried:

  S3Endpoint:
Type: 'AWS::EC2::VPCEndpoint'
Properties:
  PolicyDocument:
    Version: 2012-10-17
    Statement:
      - Effect: Allow
        Principal: '*'
        Action:
          - 's3:GetObject'
        Resource: 
          - 'arn:aws:s3:::prod-ap-southeast-2-starport-layer-bucket/*'
          - 'arn:aws:s3:::packages.*.amazonaws.com/*'
          - 'arn:aws:s3:::repo.*.amazonaws.com/*'
          - 'arn:aws:s3:::amazonlinux-2-repos-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/*'
          - 'arn:aws:s3:::amazonlinux.*.amazonaws.com/*'
          - 'arn:aws:s3:::*.amazonaws.com'
          - 'arn:aws:s3:::*.amazonaws.com/*'
          - 'arn:aws:s3:::*.ap-southeast-2.amazonaws.com/*'
          - 'arn:aws:s3:::*.ap-southeast-2.amazonaws.com/'
          - 'arn:aws:s3:::*repos.ap-southeast-2-.amazonaws.com'
          - 'arn:aws:s3:::*repos.ap-southeast-2.amazonaws.com/*'
          - 'arn:aws:s3:::repo.ap-southeast-2-.amazonaws.com'
          - 'arn:aws:s3:::repo.ap-southeast-2.amazonaws.com/*'
  RouteTableIds:
    - !Ref PrivateRouteTableA
    - !Ref PrivateRouteTableB   
  ServiceName: !Sub 'com.amazonaws.${AWS::Region}.s3'
  VpcId: !Ref BasicVpc
  VpcEndpointType: Gateway

(as you can see, very desperate) The first rule is required for the ECR interface endpoints to pull the image layers from s3, all of the others are attempts to reach amazon-linux-extras repos.

Below is the behavior happening on initialization I have recreated by connecting with session manager using SSM endpoint:

https://aws.amazon.com/premiumsupport/knowledge-center/connect-s3-vpc-endpoint/

I can not yum install or update

root@ip-10-0-3-120 bin]# yum install docker -y

Loaded plugins: extras_suggestions, langpacks, priorities, update-motd Could not retrieve mirrorlist https://amazonlinux-2-repos-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/2/core/latest/x86_64/mirror.list error was 14: HTTPS Error 403 - Forbidden

One of the configured repositories failed (Unknown), and yum doesn't have enough cached data to continue. At this point the only safe thing yum can do is fail. There are a few ways to work "fix" this:

 1. Contact the upstream for the repository and get them to fix the problem.

 2. Reconfigure the baseurl/etc. for the repository, to point to a working
    upstream. This is most often useful if you are using a newer
    distribution release than is supported by the repository (and the
    packages for the previous distribution release still work).

 3. Run the command with the repository temporarily disabled
        yum --disablerepo=<repoid> ...

 4. Disable the repository permanently, so yum won't use it by default. Yum
    will then just ignore the repository until you permanently enable it
    again or use --enablerepo for temporary usage:

        yum-config-manager --disable <repoid>
    or
        subscription-manager repos --disable=<repoid>

 5. Configure the failing repository to be skipped, if it is unavailable.
    Note that yum will try to contact the repo. when it runs most commands,
    so will have to try and fail each time (and thus. yum will be be much
    slower). If it is a very temporary problem though, this is often a nice
    compromise:

        yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true

Cannot find a valid baseurl for repo: amzn2-core/2/x86_64

and can not:

amazon-linux-extras install docker

Catalog is not reachable. Try again later.

catalogs at https://amazonlinux-2-repos-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/2/extras-catalog-x86_64-v2.json, https://amazonlinux-2-repos-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/2/extras-catalog-x86_64.json Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/amazon_linux_extras/software_catalog.py", line 131, in fetch_new_catalog request = urlopen(url) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 435, in open response = meth(req, response) File "/usr/lib64/python2.7/urllib2.py", line 548, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib64/python2.7/urllib2.py", line 473, in error return self._call_chain(*args) File "/usr/lib64/python2.7/urllib2.py", line 407, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 556, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) HTTPError: HTTP Error 403: Forbidden

Any gotchas I've missed? I'm very stuck here. I am familiar with basic VPC networking, NACLs and VPC endpoints (the ones I've used at least), I have followed the trouble-shooting (although I already had everything set-up as outlined).

I feel the s3 policy is the problem here OR the mirror list. Many thanks if you bothered to read all that! Thoughts?

3

There are 3 answers

2
Nick On BEST ANSWER

By the looks of it, you are well aware of what you are trying to achieve. Even though you are saying that it is not the NACLs, I would check them one more time, as sometimes one can easily overlook something minor. Take into account the snippet below taken from this AWS troubleshooting article and make sure that you have the right S3 CIDRs in your rules for the respective region:

Make sure that the network ACLs associated with your EC2 instance's subnet allow the following: Egress on port 80 (HTTP) and 443 (HTTPS) to the Regional S3 service. Ingress on ephemeral TCP ports from the Regional S3 service. Ephemeral ports are 1024-65535. The Regional S3 service is the CIDR for the subnet containing your S3 interface endpoint. Or, if you're using an S3 gateway, the Regional S3 service is the public IP CIDR for the S3 service. Network ACLs don't support prefix lists. To add the S3 CIDR to your network ACL, use 0.0.0.0/0 as the S3 CIDR. You can also add the actual S3 CIDRs into the ACL. However, keep in mind that the S3 CIDRs can change at any time.

Your S3 endpoint policy looks good to me on first look, but you are right that it is very likely that the policy or the endpoint configuration in general could be the cause, so I would re-check it one more time too.

One additional thing that I have observed before is that depending on the AMI you use and your VPC settings (DHCP options set, DNS, etc) sometimes the EC2 instance cannot properly set it's default region in the yum config. Please check whether the files awsregion and awsdomain exist within the /etc/yum/vars directory and what's their content. In your use case, the awsregion should have:

$ cat /etc/yum/vars/awsregion
ap-southeast-2

You can check whether the DNS resolving on your instance is working properly with:

dig amazonlinux.ap-southeast-2.amazonaws.com

If DNS seems to be working fine, you can compare whether the IP in the output resides within the ranges you have allowed in your NACLs.

EDIT:

After having a second look, this line, is a bit stricter than it should be: arn:aws:s3:::amazonlinux-2-repos-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/*

According to the docs it should be something like:

arn:aws:s3:::amazonlinux-2-repos-ap-southeast-2/*

1
GorginZ On

Hi @nick https://stackoverflow.com/users/9405602/nick --> these are excellent suggestions writing a 'answer' because trouble shooting will be valuable for others plus char limit in comment.

The problem is definitely the policy.


sh-4.2$ cat /etc/yum/vars/awsregion
ap-southeast-2sh-4.2$

dig:


sh-4.2$ dig amazonlinux.ap-southeast-2.amazonaws.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.5.2 <<>> amazonlinux.ap-southeast-2.amazonaws.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 598 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;amazonlinux.ap-southeast-2.amazonaws.com. IN A

;; ANSWER SECTION: amazonlinux.ap-southeast-2.amazonaws.com. 278 IN CNAME s3.dualstack.ap-southeast-2.amazonaws.com. s3.dualstack.ap-southeast-2.amazonaws.com. 2 IN A 52.95.134.91

;; Query time: 4 msec ;; SERVER: 10.0.0.2#53(10.0.0.2) ;; WHEN: Mon Sep 20 00:03:36 UTC 2021 ;; MSG SIZE rcvd: 112


let's check in on the NACLs:

NACL OUTBOUND RULES description: 100 All traffic All All 0.0.0.0/0
Allow 101 All traffic All All 52.95.128.0/21
Allow 150 All traffic All All 3.5.164.0/22
Allow 200 All traffic All All 3.5.168.0/23
Allow 250 All traffic All All 3.26.88.0/28
Allow 300 All traffic All All 3.26.88.16/28
Allow All traffic All All 0.0.0.0/0
Deny

NACL INBOUND RULES inbound rule description: 100 All traffic All All 10.0.0.0/24 Allow 150 All traffic All All 10.0.1.0/24 Allow 200 All traffic All All 10.0.2.0/24 Allow 250 All traffic All All 10.0.3.0/24 Allow 400 All traffic All All 52.95.128.0/21
Allow 450 All traffic All All 3.5.164.0/22
Allow 500 All traffic All All 3.5.168.0/23
Allow 550 All traffic All All 3.26.88.0/28
Allow 600 All traffic All All 3.26.88.16/28
Allow All traffic All All 0.0.0.0/0
Deny

SO -----> '52.95.134.91' is captured by rule 101 outbound and 400 inbound so that looks good NACL wise. (future people trouble shooting, this is what you should look for)

Also regarding those CIDR blocks, Deploy script pulls those from the current list and grabs out the s3 ones for ap-southeast-2 with jq and pass those as parameters to the CF deploy.

docs on how to do that for others: https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html#aws-ip-download

Another note, you might notice the out 0.0.0.0/0, I realize (and for other people looking pls note )this makes the other rules redundant, I just put it in 'in case' while fiddling (and removed out -> pub subnets). private subnet traffic outbound 0.0.0.0/0 is routed to the respective NATs in public subnets. I'll add outbound for my public subnets and remove this rule at some point.

subnetting atm is simply: 10.0.0.0/16 pub a : 10.0.0.0/24 pub b : 10.0.1.0/24 priv a : 10.0.2.0/24 priv b : 10.0.3.0/24

so out rules for pub a and b blocks will be re-introduced so i can remove the allow on 0.0.0.0/0


I am now sure it is the policy.

I just click-ops amended the policy in console to 'full access' to give that a crack and had success.

My guess is the mirror list makes it hard to pin-down what to explicitly allow, so even though I cast the net broad I wasn't capturing the required bucket. But I don't know much about how aws mirrors work so that's a guess.

I probably don't want a super duper permissive policy, so this isn't really a fix but it confirms where the issue is.

0
Technobeats On

I had a similar issue. running "amazon-linux-extras" wasn't doing anything at all.

Problem was instance had V4 and V6. V6 wasn't working properly in our outbound network-path. Disabling V6 solved it.