I'm looking for a general way to collect performance metrics on several Linux VM instances (Azure, GCP, other) and monitor the metrics in GCP.
On an Ubuntu VM in Azure, I have installed Google Cloud Ops Agent, which uses fluentd
(to collect logs) and OpenTelemetry
(to collect performance metrics) behind the scenes.
I added overrides for the two services to set environment variables so that they pick up the service account JSON credentials file, as follows:
google-cloud-ops-agent-fluent-bit.service
GOOGLE_SERVICE_CREDENTIALS
google-cloud-ops-agent-opentelemetry-collector.service
GOOGLE_APPLICATION_CREDENTIALS
See this post for more details on authentication.
I could see log messages appearing in Google Cloud Logging, which must have been scraped and sent by google-cloud-ops-agent-fluent-bit.service
. However, I couldn't find any performance metrics from google-cloud-ops-agent-opentelemetry-collector
. Where should I expect to find these in GCP? I'm convinced that there is some additional configuration I need to get this working, but the documentation seems to be about getting Ops Agent running on GCP Compute Engine instances.
Update 1:
I can see that the service is running (sudo systemctl status google-cloud-ops-agent-opentelemetry-collector.service
), but I now notice errors that I hadn't noticed before which might suggest why metrics are not making it to Google Cloud,
exporterhelper/queued_retry.go:215 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could no
t be written: No matching retention policy was found for one or more points.: timeSeries[0]\nerror details: name = Unknown desc = total_point_count:1 errors:{sta
tus:{code:9} point_count:1}", "interval": "5.52330144s"}
I don't know where to find the logs for the service other than the excerpt printed by systemctl status
.
The commandline for the service is /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml
. I took a look in the config file and see a few mentions of googlecloud
as an exporter, e.g.
exporters:
googlecloud:
metric:
prefix: ""
user_agent: Google-Cloud-Ops-Agent-Metrics/2.11.0 (BuildDistro=focal;Platform=linux;ShortName=ubuntu;ShortVersion=20.04)
Update 2: Output of service status
● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
Drop-In: /etc/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service.d
└─override.conf
Active: active (running) since Tue 2022-03-15 06:36:44 UTC; 1 day 17h ago
Process: 1053790 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
Main PID: 1053796 (otelopscol)
Tasks: 10 (limit: 19198)
Memory: 381.2M
CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
└─1053796 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: /root/go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/metrics.go:134
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: /root/go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:105
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: /root/go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:99
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: /root/go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:78
Mar 16 23:47:37 HOSTNAME otelopscol[1053796]: 2022-03-16T23:47:37.980Z info exporterhelper/queued_retry.go:215 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "googlecloud", "error": "[rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-199]\nerror details: name = Unknown desc = total_point_count:200 errors:{status:{code:9} point_count:200}; rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-199]\nerror details: name = Unknown desc = total_point_count:200 errors:{status:{code:9} point_count:200}; rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-199]\nerror details: name = Unknown desc = total_point_count:200 errors:{status:{code:9} point_count:200}; rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-111]\nerror details: name = Unknown desc = total_point_count:112 errors:{status:{code:9} point_count:112}]", "interval": "10.435795045s"}
Mar 16 23:47:49 HOSTNAME otelopscol[1053796]: 2022-03-16T23:47:49.299Z info exporterhelper/queued_retry.go:215 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: No matching retention policy was found for one or more points.: timeSeries[0-4]\nerror details: name = Unknown desc = total_point_count:5 errors:{status:{code:9} point_count:5}", "interval": "44.913550864s"}