How to debug hashicorp vault timeouts?

31 views Asked by At

Every night i run big ansible playbook (30-40 min) in which i use hashi_vault plugin to take some variables from vault and sometimes (not exactly every day) i receive an error

Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while running the lookup plugin 'hashi_vault'. Error was a <class 'requests.exceptions.ConnectTimeout'>, original message: HTTPSConnectionPool(host='vault.totalbattle.tech', port=443): Max retries exceeded with url: /v1/auth/approle/login (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0707e26dc0>, 'Connection to xxx timed out. (connect timeout=30)')). HTTPSConnectionPool(host='xxx', port=443): Max retries exceeded with url: /v1/auth/approle/login (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0707e26dc0>, 'Connection to xxx timed out. (connect timeout=30)'))"}

If i run small playbooks in which hashi_vault plugin uses couple times - everything is fine

My hashicorp uses external google LB and deployed on 5 hosts behind

I tried to check google LB logs but i didn't find any interesting information I can't understand where exactly problem, on LB or in vault

2

There are 2 answers

0
jakub-zieba On

One option for you would be to tweak the timeout set for the ansible task, Maybe there occurs some network lag from the place you run the playbook at. As per official documentation the timeout is customizable.

Another option would be running:

curl -I -vvv <your_vault_url>

To trace the http connection behavior. Either way the timout rather occurs between your ansible client and the load balancer.

0
ixe013 On

This is hard to debug, and even harder over Stackoverflow. But you tagged Google Cloud so maybe you are running your instance in Kubernetes ?

I would look into the livenessProbe of your Vault nodes. Maybe Vault is running just fine, but something that monitors its health fails and declares the instance dead. You can translate this advice to the equivalent concept on the platform you use to run Vault. Your load-balancer might do a similar liveliness test, but I would start at the instance level.

When that happens, the instance will restart and if a request comes in while the cluster is stuck in an election loop (or every node happens to be down at the same time), you will get a timeout.