-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[cmd/supervisor] Fix config merge and opamp server start order #39949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[cmd/supervisor] Fix config merge and opamp server start order #39949
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for digging into this. The issue you point out and your fix both seem valid to me, but I'm not understanding why this doesn't happen when I run the Supervisor against the example server, or in any of the other E2E tests. What conditions trigger this error?
@@ -273,6 +280,42 @@ func TestSupervisorStartsCollectorWithRemoteConfig(t *testing.T) { | |||
}, 10*time.Second, 500*time.Millisecond, "Log never appeared in output") | |||
} | |||
|
|||
func TestSupervisorStartsCollectorWithLocalConfigOnly(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is related to starting the Supervisor with only a local config and not with a remote config, could we also check the pipeline stared by the Collector by reading the files returned by createSimplePipelineCollectorConf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 54ee558
(#39949)
cmd/opampsupervisor/e2e_test.go
Outdated
require.True(t, connected.Load(), "Supervisor failed to connect") | ||
|
||
require.EventuallyWithTf(t, func(c *assert.CollectT) { | ||
require.NotContains(c, getAgentLogs(t, storageDir), "error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another way we can verify that the Collector has successfully connected to the Supervisor? I'm worried this check is a bit too broad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evan-bradley I thought the same as I was developing it. The only alternative way I could think to check whether the Collector connected back would require me to export a method in the Supervisor struct only to be able to use it in the e2e test. If that's okay for you I am happy to go this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could improve this test without adding an extra method to the Supervisor
struct in 5c21464
(#39949).
Now it's checking for the presence of the line Connected to the OpAMP server
, which is there if the Collector log level is set to debug
(which I now configure in the simple pipeline conf used by the test).
Let me know please whether this is good or if you prefer the exported method in the Supervisor
struct.
@evan-bradley I think what triggers this is the presence of a local configuration that is enough for the Supervisor to start the Collector on the first attempt, inside This bug doesn't happen in the scenario where no local configuration is present because the Supervisor would have to receive some remote configuration first and this path correctly composes and writes the effective configuration to the file before starting the Collector process (see lines 1342 and 1343 below): opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go Lines 1330 to 1343 in c2c11e6
|
Description
This PR fixes a bug that happens because of broken logic in the start up process when there is no explicit OpAMP server in the Supervisor's agent config. Currently it works like this:
opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go
Lines 432 to 435 in a0044fb
The initial merged configuration is written to a file. It uses the opamp server port already saved in the Supervisor.
The "real" opamp server is started using a different random port, leading the Collector to be constantly restarted by the Supervisor because it never connects back via OpAMP.
opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go
Lines 713 to 720 in 929656d
Testing
Added an e2e to cover this.