We have a compute engine instance hosted by Google that will lock up seemingly at random. SSH will not connect and the only way to recover appears be forcing a restart of the machine.
When I checked the logs in the monitor I can see that right at 2AM when my application monitoring tool says things went down the CPU spiked then dropped, the disk IO had a spike and memory and disk space utilization appears to stopped monitoring.
Also there are a few logs with a high severity that appeared.
Looking them up I see a connection with fluent-bit. That issue was fixed in 2.15.0 of the google-cloud-ops-agent, but my version is 2.27.0 so that should be good.
dpkg -l | grep google-cloud-ops-agent
ii google-cloud-ops-agent 2.27.0~debian11 amd64 Google Cloud Ops Agent
What would the right next step be to investigate the root cause of these lockups?


