Skip to content

[cmd/opampsupervisor] Supervisor reports last collector STDERR message #39954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dpaasman00
Copy link
Contributor

Description

If the supervisor receives a "bad" remote config (collector is unable to start or fails shortly after) and starts the collector with it, the supervisor reports a "Failed" RemoteConfigStatus and an error. This error is usually either "Config apply timeout exceeded" or "Agent process PID=1234 exited unexpectedly, exit code=1. Will restart in a bit...".

This error isn't very descriptive though as to why the collector failed and requires retrieving the collector's log to determine the root issue. In situations where these logs aren't accessible it makes debugging very difficult if not impossible.

This PR changes how the collector process is ran so that we can keep track of the last message the collector writes to STDERR. Whenever the collector process fails, we include this last error message with the supervisor's description of the issue.

For example, if the failure is an unrecognized component in the config, this is the error reported to the OpAMP server:

"Config apply timeout exceeded: \nerror decoding 'exporters': unknown type: \"doesntexist\" for id: \"doesntexist\" (valid values: [file opensearch rabbitmq sapm signalfx splunk_hec nop alertmanager alibabacloud_logservice datadog elasticsearch googlecloud googlecloudpubsub sumologic azureblob influxdb sentry syslog zipkin otlphttp dataset stef debug awss3 awsxray azuredataexplorer honeycombmarker kafka logzio opencensus awscloudwatchlogs awsemf azuremonitor bmchelix loki mezmo prometheus pulsar carbon clickhouse tencentcloud_logservice otlp awskinesis doris googlemanagedprometheus loadbalancing logicmonitor otelarrow prometheusremotewrite cassandra coralogix])"

Testing

E2E test for restarting after a bad config is updated to check for an error message.

Documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants