[cmd/opampsupervisor] Supervisor reports last collector STDERR message #39954

dpaasman00 · 2025-05-08T18:38:36Z

Description

If the supervisor receives a "bad" remote config (collector is unable to start or fails shortly after) and starts the collector with it, the supervisor reports a "Failed" RemoteConfigStatus and an error. This error is usually either "Config apply timeout exceeded" or "Agent process PID=1234 exited unexpectedly, exit code=1. Will restart in a bit...".

This error isn't very descriptive though as to why the collector failed and requires retrieving the collector's log to determine the root issue. In situations where these logs aren't accessible it makes debugging very difficult if not impossible.

This PR changes how the collector process is ran so that we can keep track of the last message the collector writes to STDERR. Whenever the collector process fails, we include this last error message with the supervisor's description of the issue.

For example, if the failure is an unrecognized component in the config, this is the error reported to the OpAMP server:

"Config apply timeout exceeded: \nerror decoding 'exporters': unknown type: \"doesntexist\" for id: \"doesntexist\" (valid values: [file opensearch rabbitmq sapm signalfx splunk_hec nop alertmanager alibabacloud_logservice datadog elasticsearch googlecloud googlecloudpubsub sumologic azureblob influxdb sentry syslog zipkin otlphttp dataset stef debug awss3 awsxray azuredataexplorer honeycombmarker kafka logzio opencensus awscloudwatchlogs awsemf azuremonitor bmchelix loki mezmo prometheus pulsar carbon clickhouse tencentcloud_logservice otlp awskinesis doris googlemanagedprometheus loadbalancing logicmonitor otelarrow prometheusremotewrite cassandra coralogix])"

Testing

E2E test for restarting after a bad config is updated to check for an error message.

Documentation

dpaasman00 added 2 commits May 8, 2025 13:07

report collectors last err from stderr

6fb54a8

update e2e test

cea4afe

dpaasman00 requested review from evan-bradley, atoulme, tigrannajaryan and a team as code owners May 8, 2025 18:38

github-actions bot assigned codeboten May 8, 2025

github-actions bot added the cmd/opampsupervisor label May 8, 2025

chlog

3b19a58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cmd/opampsupervisor] Supervisor reports last collector STDERR message #39954

[cmd/opampsupervisor] Supervisor reports last collector STDERR message #39954

dpaasman00 commented May 8, 2025

[cmd/opampsupervisor] Supervisor reports last collector STDERR message #39954

Are you sure you want to change the base?

[cmd/opampsupervisor] Supervisor reports last collector STDERR message #39954

Conversation

dpaasman00 commented May 8, 2025

Description

Testing

Documentation