-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Configure HealthCheck with podman update
#24442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure HealthCheck with podman update
#24442
Conversation
Ephemeral COPR build failed. @containers/packit-build please check. |
2f18a35
to
475e16f
Compare
475e16f
to
e052162
Compare
e052162
to
432b6d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really a full review just a quick look:
Can you remove all the flag description from the commit? I do not see how they add any value there. If I want to know what a flag does one should look at the docs or I can see that already in the diff here anyway.
libpod/container_internal.go
Outdated
noHealthCheck := false | ||
for k, v := range *changedHealthCheckConfig { | ||
switch k { | ||
case "NoHealthCheck": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these strings seem to be used in many places so can you define them as constants somewhere, ie. libpod/define and then use the constants over the string duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have an idea of how to simplify it overall.
621c46c
to
96def55
Compare
cmd/podman/common/create.go
Outdated
@@ -665,6 +548,141 @@ func DefineCreateFlags(cmd *cobra.Command, cf *entities.ContainerCreateOptions, | |||
`If a container with the same name exists, replace it`, | |||
) | |||
} | |||
if mode == entities.CreateMode || mode == entities.UpdateMode { | |||
// TODO: Focus on disable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My note, I'll delete it.
cmd/podman/containers/update.go
Outdated
@@ -17,7 +17,7 @@ import ( | |||
) | |||
|
|||
var ( | |||
updateDescription = `Updates the cgroup configuration of a given container` | |||
updateDescription = `Updates the configuration of an already existing container, allowing different resource limits to be set, and HealthCheck configuration. The currently supported options are a subset of the podman create/run.` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Updates the configuration of an existing container, allowing changes to resource limits and healthchecks"
libpod/container_internal.go
Outdated
@@ -2738,3 +2741,327 @@ func (c *Container) update(resources *spec.LinuxResources, restartPolicy *string | |||
|
|||
return nil | |||
} | |||
|
|||
func (c *Container) getCopyOfHealthCheckAndStartupHelathCheck() (*manifest.Schema2HealthConfig, *define.StartupHealthCheck) { | |||
var healthCheck manifest.Schema2HealthConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and startupHealthCheck should be pointers - will make returning nil a lot easier
libpod/container_internal.go
Outdated
func (c *Container) getCopyOfHealthCheckAndStartupHelathCheck() (*manifest.Schema2HealthConfig, *define.StartupHealthCheck) { | ||
var healthCheck manifest.Schema2HealthConfig | ||
if c.config.HealthCheckConfig != nil { | ||
healthCheck = manifest.Schema2HealthConfig{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can JSONDeepCopy this and save a lot of code
libpod/container_internal.go
Outdated
|
||
var startupHealthCheck define.StartupHealthCheck | ||
if c.config.StartupHealthCheckConfig != nil { | ||
startupHealthCheck = define.StartupHealthCheck{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, investigate JSONDeepCopy, I don't think this is performance critical at all.
libpod/container_internal.go
Outdated
return &healthCheck, &startupHealthCheck | ||
} | ||
|
||
func (c *Container) changeHealthCheckConfiguration(changedHealthCheckConfig *entities.UpdateHealthCheckConfig) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like leaking entities into Libpod. Maybe move UpdateHealthCheckConfig into libpod and make the entities version just contain it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no. We should probably move this parsing code out of Libpod. Instead we should accept a manifest.Schema2HealthConfig
and define.StartupHealthCheck
in Update in libpod, and do all the validation in the frontend.
libpod/container_internal.go
Outdated
} | ||
|
||
logrus.Debugf("HealthCheck updated for container %s", c.ID()) | ||
c.newContainerEvent(events.Update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're calling this from Update() this is unnecessary, we only want one and it should be in Update itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment here still seems to be open, we should not be creating two events
74eae00
to
4471330
Compare
libpod/container_api.go
Outdated
@@ -136,6 +137,61 @@ func (c *Container) Update(resources *spec.LinuxResources, restartPolicy *string | |||
return c.update(resources, restartPolicy, restartRetries) | |||
} | |||
|
|||
// UpdateHealthCheckConfig updates HealthCheck configuration the given container. | |||
func (c *Container) UpdateHealthCheckConfig(healthCheckConfig *manifest.Schema2HealthConfig, changedTimer bool, noHealthCheck bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noHealthCheck can probably be healthCheckConfig == nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think we might want to compute changedTimer
in here, instead of requiring the user pass it in - moving the rest of the parsing out of libpod was good but this was maybe a step too far.
libpod/container_api.go
Outdated
|
||
// UpdateGlobalHealthCheckConfig updates global HealthCheck configuration the given container. | ||
// If value is nil then value will be not changed. | ||
func (c *Container) UpdateGlobalHealthCheckConfig(healthLogDestination *string, healthMaxLogCount *uint, healthMaxLogSize *uint, healthCheckOnFailureAction *define.HealthCheckOnFailureAction) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make a GlobalHealthCheckOptions
struct that contains all of these? Arguments list is a little long
libpod/container_internal.go
Outdated
@@ -2738,3 +2739,122 @@ func (c *Container) update(resources *spec.LinuxResources, restartPolicy *string | |||
|
|||
return nil | |||
} | |||
|
|||
func (c *Container) resetHealthCheckTimers(noHealthCheck bool, changedTimer bool, isStartup bool) error { | |||
if !c.ensureState(define.ContainerStateCreated, define.ContainerStateRunning) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Paused as well? Do we stop the timers when we pause a container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found any code that can stop the timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. We should talk about this at standup on Monday. Not something we should solve in this PR but it does sound like a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timer runs even after the container is paused. Systemctl displays an error for a few seconds:
Nov 15 18:08:33 fedora podman[42]: error: container <id> is not running
We can discuss it on standup, but on Monday, I am going to midterm. I can create an Issue so we don't forget about it and then we can discuss it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating and rebasing and for your initiative in cleaning up more than I asked for.
Tests LGTM. One comment inline, not a blocker.
run_podman healthcheck run $ctrname | ||
is "$output" "" "output from 'podman healthcheck run'" | ||
|
||
# Run podman update in two separate runs to make sure HealthCheck is overwritten correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this new action (splitting into two commands) nor its comment. I will assume it makes sense to everyone else, and won't block on it but will point it out in case others are puzzled by it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I found a bug. When you try to change any resource using podman update, it overwrites the current resource configuration. Here is an example:
$ ./bin/podman run -dt --replace --name hc1 quay.io/libpod/alpine:latest top
$ ./bin/podman inspect hc1 --format {{.HostConfig.PidsLimit}}
2048
$ ./bin/podman update hc1 --pids-limit 1024
$ ./bin/podman inspect hc1 --format {{.HostConfig.PidsLimit}}
1024
$ ./bin/podman update hc1 --cpus 2 # Change any other resource.
$ ./bin/podman inspect hc1 --format {{.HostConfig.PidsLimit}}
0 <— this value should be 1024
I want to make sure this is not happening for Healtcheck as well.
@mheon @TomSweeneyRedHat @Luap99 I have updated the PR according to your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some more time to undersatnd this.
libpod/container_internal.go
Outdated
} | ||
|
||
logrus.Debugf("HealthCheck updated for container %s", c.ID()) | ||
c.newContainerEvent(events.Update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment here still seems to be open, we should not be creating two events
libpod/container_internal.go
Outdated
return nil | ||
} | ||
|
||
func (c *Container) updateHealthCheck(newHealthCheckConfig interface{}, currentHealthCheckConfig interface{}, setConfigFunc func(config interface{})) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this accepting interfaces, this makes API design super fragile and error prone. We should not be mixing so many types. reflection is difficult to understand and use correctly.
libpod/healthcheck.go
Outdated
// NOTE: This is only safe while being called from "podman healthcheck run" which we know | ||
// is the case here as we should not alter the exit code of another process that just | ||
// happened to call this. | ||
shutdown.SetExitCode(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait this is no longer true now, if you call this from outside podman healtchcheck run
we have big problems espically for the system service as this overwrites the global exit code on signals.
libpod/container_internal.go
Outdated
c.valid = false | ||
err := WithHealthCheckLogDestination(*globalOptions.HealthLogDestination)(c) | ||
if err != nil { | ||
return err | ||
} | ||
c.valid = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is extremely ugly, setting a container to not valid like this is not sane if the pointer is shared acorrs several threads another container may see a invalid container and bail out. I don't think we that is a huge risk as we not really share these structs at the same time but still.
And calling CtrCreateOption function somehwere else is just not nice to read. Please just split out the validation bits into another function and then call it from here and WithHealthCheckLogDestination()
utils.Error(w, http.StatusInternalServerError, fmt.Errorf("decode(): %w", err)) | ||
return | ||
} | ||
err = ctr.Update(options.Resources, restartPolicy, restartRetries) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just call (ic *ContainerEngine) ContainerUpdate() from like many other endpoints do to avoid duplication
pkg/domain/infra/abi/containers.go
Outdated
healthCheckConfig, changedHealthCheck, err := specgenutil.GetNewHealthCheckConfig(container.HealthCheckConfig(), *updateOptions.ChangedHealthCheckConfiguration) | ||
if err != nil { | ||
return "", err | ||
} | ||
if changedHealthCheck { | ||
if err := container.UpdateHealthCheckConfig(healthCheckConfig); err != nil { | ||
return "", err | ||
} | ||
} | ||
|
||
startupHealthCheckConfig, changedStartupHealthCheck, err := specgenutil.GetNewStartupHealthCheckConfig(container.Config().StartupHealthCheckConfig, *updateOptions.ChangedHealthCheckConfiguration) | ||
if err != nil { | ||
return "", err | ||
} | ||
if changedStartupHealthCheck { | ||
if err := container.UpdateStartupHealthCheckConfig(startupHealthCheckConfig); err != nil { | ||
return "", err | ||
} | ||
} | ||
|
||
globalHealthCheckOptions, err := specgenutil.GetNewGlobalHealthCheck(*updateOptions.ChangedHealthCheckConfiguration) | ||
if err != nil { | ||
return "", err | ||
} | ||
if err := container.UpdateGlobalHealthCheckConfig(globalHealthCheckOptions); err != nil { | ||
return "", err | ||
} | ||
|
||
if err = container.Update(updateOptions.Specgen.ResourceLimits, restartPolicy, updateOptions.Specgen.RestartRetries); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this entire flow makes updates not atomic which seems bad. Every time you call into libpod we have to lock again. sync the state again and then do whatever we do save the db config and state again. Doing this multiple times is slow and not atomic which means ugly race conditions most likely.
I rather have this moved into one libpod function that acts under one lock.
pkg/specgenutil/specgen.go
Outdated
return uint(valUint), nil | ||
} | ||
|
||
func GetChangedHealthCheckConfiguration(cmd *cobra.Command) define.UpdateHealthCheckConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cobra imports should not be in specgen, cli parts should all stay under cmd/podman.
I See noe reason why the function is here.
pkg/specgenutil/specgen.go
Outdated
case "health-max-log-size": | ||
val, err := parseUint(valStr) | ||
if err != nil { | ||
logrus.Error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these errors must be handled and returned.
pkg/specgenutil/specgen.go
Outdated
|
||
func GetChangedHealthCheckConfiguration(cmd *cobra.Command) define.UpdateHealthCheckConfig { | ||
updateHealthCheckConfig := define.UpdateHealthCheckConfig{} | ||
cmd.Flags().Visit(func(f *pflag.Flag) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we loop here at all over all flags? Can't we just access all the flags directly if we have to list out all name anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visit visits only those flags that have been set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use .Changed() on a flag to know if was set or not
9145618
to
7ec1665
Compare
/packit retest-failed |
If @Luap99 is happy with the answers to your conversations, then LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this looks pretty good now but I think we can get rid of the last interface{} type casting as well.
*define.StartupHealthCheck | ||
} | ||
|
||
type IHealthCheckConfig interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I like this abstraction over the interface{}/any casting with reflect.
pkg/specgenutil/specgen.go
Outdated
return globalOptions, nil | ||
} | ||
|
||
func getHealthCheckOptions(originalHealthCheckConfig interface{}) define.HealthCheckOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can use the new IHealthCheckConfig type here instead the reflect logic in this function?
You might need to move the IHealthCheckConfig definition out of libpod for that so it can be imported on the remote client. (i.e. move it to libpod/define)
New flags in a `podman update` can change the configuration of HealthCheck when the container is started, without having to restart or recreate the container. This can help determine why a given container suddenly started failing HealthCheck without interfering with the services it provides. For example, reconfigure HealthCheck to keep logs longer than the usual last X results, store logs to other destinations, etc. Fixes: https://issues.redhat.com/browse/RHEL-60561 Signed-off-by: Jan Rodák <[email protected]>
7ec1665
to
a124942
Compare
@Luap99 I reworked the code to reduce code duplication, which removed the use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: edsantiago, Honny1, Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@mheon PTAL |
/lgtm |
New flags in a
podman update
can change the configuration of HealthCheck when the container is started, without having to restart or recreate the container.This can help determine why a given container suddenly started failing HealthCheck without interfering with the services it provides. For example, reconfigure HealthCheck to keep logs longer than the usual last X results, store logs to other destinations, etc.
These flags are added to the
podman update
command:--health-cmd
string
: set a healthcheck command for the container ('none' disables the existing healthcheck)--health-interval
string
: set an interval for the healthcheck (a value of disable results in no automatic timer setup)(Changing this setting resets timer.) (default "30s")--health-log-destination
string
: set the destination of the HealthCheck log. Directory path, local or events_logger (local use container state file)(Warning: Changing this setting may cause the loss of previous logs.) (default "local")--health-max-log-count
uint
: set maximum number of attempts in the HealthCheck log file. ('0' value means an infinite number of attempts in the log file) (default 5)--health-max-log-size
uint
: set maximum length in characters of stored HealthCheck log. ('0' value means an infinite log length) (default 500)--health-on-failure
string
: action to take once the container turns unhealthy (default "none")--health-retries
uint
: the number of retries allowed before a healthcheck is considered to be unhealthy (default 3)--health-start-period
string
: the initialization time needed for a container to bootstrap (default "0s")--health-startup-cmd
string
: Set a startup healthcheck command for the container--health-startup-interval
string
: Set an interval for the startup healthcheck. Changing this setting resets the timer, depending on the state of the container. (default "30s")--health-startup-retries
uint
: Set the maximum number of retries before the startup healthcheck will restart the container--health-startup-success
uint
: Set the number of consecutive successes before the startup healthcheck is marked as successful and the normal healthcheck begins (0 indicates any success will start the regular healthcheck)--health-startup-timeout
string
: Set the maximum amount of time that the startup healthcheck may take before it is considered failed (default "30s")--health-timeout
string
: the maximum time allowed to complete the healthcheck before an interval is considered failed (default "30s")--no-healthcheck
: Disable healthchecks on containerFixes: https://issues.redhat.com/browse/RHEL-60561
Does this PR introduce a user-facing change?