I have a question about Google Cloud DNS:

  • I would like to redirect the traffic from one GCP project to another, keeping compatibility of both underlying domains and causing zero downtime during the migration process. I've read that dns-authorization is a good alternative to do so, but I'll first add some context in order to better explain the issue.

Some years ago, a cloud service was developed and deployed in a GCP project with a certain domain (let's call it legacy):

Infrastructure 1:

  • Project: legacyenv-mycloudservice
  • Managed zone:
    • name: legacydomain.com
    • A: legacydomain.com (IP pointing to the load balancer of Nginx in GKE)
    • CNAME: myservice1.legacydomain.com
    • CNAME: myservice2.legacydomain.com
    • CNAME: myservice3.legacydomain.com
  • GKE:
    • Nginx controller
    • Cert Manager
      • One ingress per service in managed zone CNAMEs
      • One certificate per service in managed zone CNAMEs

Infrastructure 2:

After constructing the above mentioned infrastructure manually, I started to develop other apps by following infrastructure as code approach with Terraform. I recently created a new version of the infrastructure from scratch, and it is currently working with the new domain and same underlying cloud services. However it also has some small changes:

  • Project: newenv-mycloudservice
  • Managed zone:
    • name: newdomain.com
    • A: myservice1.newdomain.com (IP pointing to google's load balancer)
    • CAA: myservice1.newdomain.com
    • A: myservice2.newdomain.com (IP pointing to google's load balancer)
    • CAA: myservice2.newdomain.com
    • A: myservice3.newdomain.com (IP pointing to google's load balancer)
    • CAA: myservice3.newdomain.com
  • Load balancer:
    • A backend service is defined and attached to each GKE service by using NEGs
    • Each host/path is matched against its corresponding backend service
    • Each domain has a google managed SSL certificate
  • GKE:
    • Each service defined by using NEG

As I said, this environment works, but it has no real traffic since the systems around the world are still pointing to the legacy domain.

At this moment, I would like to make a migration so that {service}.legacydomain.com points to the same IP as in {service}.newdomain.com. To do so, first I used a test gcp project (let's call it test-legacyenv-mycloudservice) to emulate legacyenv-mycloudservice and not interfere with production traffic:

  • In this test-legacyenv-mycloudservice testing environment I just created an exact replica of the GKE cluster as in legacyenv-mycloudservice, including all Nginx, Cert-manager and services
  • Additionally, I added 3 extra {service}.testenvironment.legacydomain.com CNAME record sets in legacyenv-mycloudservice pointing to load-balancer of Nginx in the testing GKE
  • I issued 3 certificates via Cert-manager for all {service}.testenvironment.legacydomain.com ingresses
  • Now I could safely attempt to emulate the domain migration with no risk

I attempted to do the migration by following those steps:

  • In newenv-mycloudservice: modify google_compute_url_map object by adding 3 host_rule & path_matcher objects for each {service}.testenvironment.legacydomain.com domain matched with their corresponding backend service.
  • In test-legacyenv-mycloudservice: scale down the deployments in GKE to cut the traffic to all the services
  • In legacyenv-mycloudservice: delete all {service}.testenvironment.legacydomain.com CNAME record sets
  • In legacyenv-mycloudservice: create 3 additional A/CAA entries for each {service}.testenvironment.legacydomain.com domain.
    • All the A records point to the google's load balancer in newenv-mycloudservice
  • In newenv-mycloudservice: create 3 additional google_compute_managed_ssl_certificate resources for each {service}.testenvironment.legacydomain.com domain and add them to google_compute_target_https_proxy

The above steps lasted between 15-25 minutes until the IP change is propagated, certificates are issued and traffic is correct and securely pointing to the new environment. Despite the fact that it works, I would like to accomplish the same task with zero downtime.

I have been following this documentation about the dns authorization approach, which looks like a good alternative when migrating from other providers like AWS. I attempted to reproduce the steps until I got stuck in the certificate issuance step. A CONFIG error was raised after some minutes, which looks like a DNS configuration error according to the docs. In my opinion the conflict comes when I created a legacydomain.com managed zone inside newenv-mycloudservice (one of the prerequisites in order to use dns-authorization), which already exists in legacyenv-mycloudservice.

In another stackoverflow thread, a redirection of the name servers from one project to another is suggested, but I'm not sure about it being the solution. The original managed zone and name servers configuration are in legacyenv-mycloudservice, thus I would prefer not touching that without being 100% secure that this approach will work with no interference to the production traffic.

Could someone please tell whether the mentioned overlapping-dns-zones approach could make the certificate issuance to be completed, and as a consequence perform the domain migration with no downtime? If not, which will be the steps to follow?

Thanks in advance.

0

There are 0 answers