[cmd/opampsupervisor] Supervisor reports last collector STDERR message #39954
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
If the supervisor receives a "bad" remote config (collector is unable to start or fails shortly after) and starts the collector with it, the supervisor reports a "Failed" RemoteConfigStatus and an error. This error is usually either "Config apply timeout exceeded" or "Agent process PID=1234 exited unexpectedly, exit code=1. Will restart in a bit...".
This error isn't very descriptive though as to why the collector failed and requires retrieving the collector's log to determine the root issue. In situations where these logs aren't accessible it makes debugging very difficult if not impossible.
This PR changes how the collector process is ran so that we can keep track of the last message the collector writes to STDERR. Whenever the collector process fails, we include this last error message with the supervisor's description of the issue.
For example, if the failure is an unrecognized component in the config, this is the error reported to the OpAMP server:
Testing
E2E test for restarting after a bad config is updated to check for an error message.
Documentation