Skip to content

[cmd/supervisor] Fix config merge and opamp server start order #39949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

douglascamata
Copy link
Contributor

@douglascamata douglascamata commented May 8, 2025

Description

This PR fixes a bug that happens because of broken logic in the start up process when there is no explicit OpAMP server in the Supervisor's agent config. Currently it works like this:

  1. The bootstrap happens, which starts the opamp server in the Supervisor using a random port. This leaves the port saved in Supervisor:

func (s *Supervisor) getBootstrapInfo() (err error) {
_, span := s.getTracer().Start(context.Background(), "GetBootstrapInfo")
defer span.End()
s.opampServerPort, err = s.getSupervisorOpAMPServerPort()

  1. The initial merged configuration is written to a file. It uses the opamp server port already saved in the Supervisor.

  2. The "real" opamp server is started using a different random port, leading the Collector to be constantly restarted by the Supervisor because it never connects back via OpAMP.

func (s *Supervisor) startOpAMPServer() error {
s.opampServer = server.New(newLoggerFromZap(s.telemetrySettings.Logger, "opamp-server"))
var err error
s.opampServerPort, err = s.getSupervisorOpAMPServerPort()
if err != nil {
return err
}

Testing

Added an e2e to cover this.

Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging into this. The issue you point out and your fix both seem valid to me, but I'm not understanding why this doesn't happen when I run the Supervisor against the example server, or in any of the other E2E tests. What conditions trigger this error?

@@ -273,6 +280,42 @@ func TestSupervisorStartsCollectorWithRemoteConfig(t *testing.T) {
}, 10*time.Second, 500*time.Millisecond, "Log never appeared in output")
}

func TestSupervisorStartsCollectorWithLocalConfigOnly(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is related to starting the Supervisor with only a local config and not with a remote config, could we also check the pipeline stared by the Collector by reading the files returned by createSimplePipelineCollectorConf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 54ee558 (#39949)

require.True(t, connected.Load(), "Supervisor failed to connect")

require.EventuallyWithTf(t, func(c *assert.CollectT) {
require.NotContains(c, getAgentLogs(t, storageDir), "error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another way we can verify that the Collector has successfully connected to the Supervisor? I'm worried this check is a bit too broad.

Copy link
Contributor Author

@douglascamata douglascamata May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evan-bradley I thought the same as I was developing it. The only alternative way I could think to check whether the Collector connected back would require me to export a method in the Supervisor struct only to be able to use it in the e2e test. If that's okay for you I am happy to go this way.

Copy link
Contributor Author

@douglascamata douglascamata May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could improve this test without adding an extra method to the Supervisor struct in 5c21464 (#39949).

Now it's checking for the presence of the line Connected to the OpAMP server, which is there if the Collector log level is set to debug (which I now configure in the simple pipeline conf used by the test).

Let me know please whether this is good or if you prefer the exported method in the Supervisor struct.

@douglascamata
Copy link
Contributor Author

douglascamata commented May 9, 2025

Thank you for digging into this. The issue you point out and your fix both seem valid to me, but I'm not understanding why this doesn't happen when I run the Supervisor against the example server, or in any of the other E2E tests. What conditions trigger this error?

@evan-bradley I think what triggers this is the presence of a local configuration that is enough for the Supervisor to start the Collector on the first attempt, inside Supervisor.Start. This scenario wasn't covered before by any other e2e test. Even if it was covered, we didn't have any check confirming that the Collector could connect back to the Supervisor.

This bug doesn't happen in the scenario where no local configuration is present because the Supervisor would have to receive some remote configuration first and this path correctly composes and writes the effective configuration to the file before starting the Collector process (see lines 1342 and 1343 below):

case <-s.hasNewConfig:
s.lastHealthFromClient = nil
if !configApplyTimeoutTimer.Stop() {
select {
case <-configApplyTimeoutTimer.C: // Try to drain the channel
default:
}
}
configApplyTimeoutTimer.Reset(s.config.Agent.ConfigApplyTimeout)
s.telemetrySettings.Logger.Debug("Restarting agent due to new config")
restartTimer.Stop()
s.stopAgentApplyConfig()
status, err := s.startAgent()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants