Skip to content

⚠️ [Warm Replicas] Implement warm replica support for controllers. #3192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

godwinpang
Copy link
Contributor

This change implements the proposal for warm replicas as proposed in #3121.
It adds a NeedWarmUp option for controllers to optionally start as warmed replicas.

Note for reviewers: This draft PR is feature complete with tests but the main purpose is to make sure that things are on the right track. Some of the naming / comments are a bit inconsistent, I will address them in a followup cleanup tomorrow.

Builds upon #3190.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: godwinpang
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 9, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 9, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @godwinpang. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@godwinpang godwinpang marked this pull request as draft April 9, 2025 07:12
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 9, 2025
@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025
@godwinpang
Copy link
Contributor Author

/retest

@@ -439,6 +439,11 @@ func (cm *controllerManager) Start(ctx context.Context) (err error) {
return fmt.Errorf("failed to start other runnables: %w", err)
}

// Start and wait for sources to start.
if err := cm.runnables.Warmup.Start(cm.internalCtx); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should come right after the caches and the comment shouldn't make assumptions about what the Warmup internally does.

The other issue: This needs to block until the Warmup has terminated, otherwise we may end up starting the controller before the sources are started, as the check we added only checks if we started to start the sources, not if we finished doing so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to coming right after the caches, but don't completely understand the significance of having it after vs. before the non-leader election runnables; mind explaining a bit?

Copy link
Contributor Author

@godwinpang godwinpang Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to block until the Warmup has terminated, otherwise we may end up starting the controller before the sources are started, as the check we added only checks if we started to start the sources, not if we finished doing so

Is this a big problem? Replicas will only start the controller after they win leader election so I don't see an issue in the leader election failover case; are you saying that in non-leader election cases the behavior of warmup should be that it completely blocks controller startup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to coming right after the caches, but don't completely understand the significance of having it after vs. before the non-leader election runnables; mind explaining a bit?

Just so it starts asap

Is this a big problem? Replicas will only start the controller after they win leader election so I don't see an issue in the leader election failover case; are you saying that in non-leader election cases the behavior of warmup should be that it completely blocks controller startup?

Yes. The expectation in the controller is that the sources are fully started before the rest of the controller starts. If we don't ensure this actually finished before we start the controller, we violate that expectation.

Additionally, there may be other cases where ppl implement this interface and it seems surprising to me that we wouldn't let the warmup finish before we start.

If we wanted to be really fancy, we could do this per-runnable (i.E. not block all runnables on Warmup having finished, just the ones whose Warmup didn't finish) but that is probably more involved.

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang godwinpang force-pushed the warm-replica-impl branch from e05677a to 1d07efc Compare May 2, 2025 05:56
@godwinpang
Copy link
Contributor Author

/retest

1 similar comment
@godwinpang
Copy link
Contributor Author

/retest

@godwinpang godwinpang force-pushed the warm-replica-impl branch from 1c32d37 to 7cc29dc Compare May 2, 2025 06:26
@godwinpang
Copy link
Contributor Author

/retest

@godwinpang godwinpang force-pushed the warm-replica-impl branch from 7cc29dc to 43118a3 Compare May 2, 2025 09:45
@godwinpang
Copy link
Contributor Author

/retest

1 similar comment
@godwinpang
Copy link
Contributor Author

/retest

@godwinpang godwinpang marked this pull request as ready for review May 2, 2025 18:18
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 2, 2025
@k8s-ci-robot k8s-ci-robot requested review from FillZpp and JoelSpeed May 2, 2025 18:18
@@ -93,6 +93,13 @@ type TypedOptions[request comparable] struct {
//
// Note: This flag is disabled by default until a future version. It's currently in beta.
UsePriorityQueue *bool

// NeedWarmup specifies whether the controller should start its sources when the manager is not
// the leader. When set to true, the controller will not restart its sources when/if it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what reasons there might be to use this and the restart bit is incorrect, it would be start. Please also use the same explanation in pkg/config/controller

// Warmup implements the manager.WarmupRunnable interface.
func (c *Controller[request]) Warmup(ctx context.Context) error {
if c.NeedWarmup == nil || !*c.NeedWarmup {
c.didEventSourcesFinishSync.Store(ptr.To(true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't finish sync, why are we storing true in here?

@@ -1014,6 +1014,82 @@ var _ = Describe("controller", func() {
})
})
})

Describe("Warmup", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test where Warmup is called but NeedsWarmup is disabled that validates that nothing happens in Warmup and that the sources do get started when Start is called


// WaitForWarmupComplete is a blocking function that waits for the warmup to be completed. It
// returns false if it could not successfully finish warmup.
WaitForWarmupComplete(context.Context) bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, why can't we instead wait for Warmup to return for this check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants