Configure HealthCheck with `podman update` #24442

Honny1 · 2024-11-01T07:28:22Z

New flags in a podman update can change the configuration of HealthCheck when the container is started, without having to restart or recreate the container.

This can help determine why a given container suddenly started failing HealthCheck without interfering with the services it provides. For example, reconfigure HealthCheck to keep logs longer than the usual last X results, store logs to other destinations, etc.

These flags are added to the podman update command:

--health-cmd string: set a healthcheck command for the container ('none' disables the existing healthcheck)
--health-interval string: set an interval for the healthcheck (a value of disable results in no automatic timer setup)(Changing this setting resets timer.) (default "30s")
--health-log-destination string: set the destination of the HealthCheck log. Directory path, local or events_logger (local use container state file)(Warning: Changing this setting may cause the loss of previous logs.) (default "local")
--health-max-log-count uint: set maximum number of attempts in the HealthCheck log file. ('0' value means an infinite number of attempts in the log file) (default 5)
--health-max-log-size uint: set maximum length in characters of stored HealthCheck log. ('0' value means an infinite log length) (default 500)
--health-on-failure string: action to take once the container turns unhealthy (default "none")
--health-retries uint: the number of retries allowed before a healthcheck is considered to be unhealthy (default 3)
--health-start-period string: the initialization time needed for a container to bootstrap (default "0s")
--health-startup-cmd string: Set a startup healthcheck command for the container
--health-startup-interval string: Set an interval for the startup healthcheck. Changing this setting resets the timer, depending on the state of the container. (default "30s")
--health-startup-retries uint: Set the maximum number of retries before the startup healthcheck will restart the container
--health-startup-success uint: Set the number of consecutive successes before the startup healthcheck is marked as successful and the normal healthcheck begins (0 indicates any success will start the regular healthcheck)
--health-startup-timeout string: Set the maximum amount of time that the startup healthcheck may take before it is considered failed (default "30s")
--health-timeout string: the maximum time allowed to complete the healthcheck before an interval is considered failed (default "30s")
--no-healthcheck: Disable healthchecks on container

Fixes: https://issues.redhat.com/browse/RHEL-60561

Does this PR introduce a user-facing change?

Configure HealthCheck with podman update

packit-as-a-service · 2024-11-01T07:29:59Z

Ephemeral COPR build failed. @containers/packit-build please check.

Luap99

not really a full review just a quick look:

Can you remove all the flag description from the commit? I do not see how they add any value there. If I want to know what a flag does one should look at the docs or I can see that already in the diff here anyway.

Luap99 · 2024-11-01T17:16:26Z

libpod/container_internal.go

+	noHealthCheck := false
+	for k, v := range *changedHealthCheckConfig {
+		switch k {
+		case "NoHealthCheck":


all these strings seem to be used in many places so can you define them as constants somewhere, ie. libpod/define and then use the constants over the string duplication

Thanks, I have an idea of how to simplify it overall.

mheon · 2024-11-04T18:48:50Z

cmd/podman/common/create.go

@@ -665,6 +548,141 @@ func DefineCreateFlags(cmd *cobra.Command, cf *entities.ContainerCreateOptions,
 			`If a container with the same name exists, replace it`,
 		)
 	}
+	if mode == entities.CreateMode || mode == entities.UpdateMode {
+		// TODO: Focus on disable


What does this mean?

My note, I'll delete it.

mheon · 2024-11-04T18:49:41Z

cmd/podman/containers/update.go

@@ -17,7 +17,7 @@ import (
 )

 var (
-	updateDescription = `Updates the cgroup configuration of a given container`
+	updateDescription = `Updates the configuration of an already existing container, allowing different resource limits to be set, and HealthCheck configuration. The currently supported options are a subset of the podman create/run.`


"Updates the configuration of an existing container, allowing changes to resource limits and healthchecks"

mheon · 2024-11-04T19:14:47Z

libpod/container_internal.go

@@ -2738,3 +2741,327 @@ func (c *Container) update(resources *spec.LinuxResources, restartPolicy *string

 	return nil
 }
+
+func (c *Container) getCopyOfHealthCheckAndStartupHelathCheck() (*manifest.Schema2HealthConfig, *define.StartupHealthCheck) {
+	var healthCheck manifest.Schema2HealthConfig


This and startupHealthCheck should be pointers - will make returning nil a lot easier

mheon · 2024-11-04T19:15:20Z

libpod/container_internal.go

+func (c *Container) getCopyOfHealthCheckAndStartupHelathCheck() (*manifest.Schema2HealthConfig, *define.StartupHealthCheck) {
+	var healthCheck manifest.Schema2HealthConfig
+	if c.config.HealthCheckConfig != nil {
+		healthCheck = manifest.Schema2HealthConfig{


I think you can JSONDeepCopy this and save a lot of code

mheon · 2024-11-04T19:15:41Z

libpod/container_internal.go

+
+	var startupHealthCheck define.StartupHealthCheck
+	if c.config.StartupHealthCheckConfig != nil {
+		startupHealthCheck = define.StartupHealthCheck{


Same here, investigate JSONDeepCopy, I don't think this is performance critical at all.

mheon · 2024-11-04T19:16:36Z

libpod/container_internal.go

+	return &healthCheck, &startupHealthCheck
+}
+
+func (c *Container) changeHealthCheckConfiguration(changedHealthCheckConfig *entities.UpdateHealthCheckConfig) (bool, error) {


I don't like leaking entities into Libpod. Maybe move UpdateHealthCheckConfig into libpod and make the entities version just contain it?

Actually, no. We should probably move this parsing code out of Libpod. Instead we should accept a manifest.Schema2HealthConfig and define.StartupHealthCheck in Update in libpod, and do all the validation in the frontend.

mheon · 2024-11-04T19:19:56Z

libpod/container_internal.go

+	}
+
+	logrus.Debugf("HealthCheck updated for container %s", c.ID())
+	c.newContainerEvent(events.Update)


If you're calling this from Update() this is unnecessary, we only want one and it should be in Update itself

This comment here still seems to be open, we should not be creating two events

Honny1 · 2024-11-06T18:41:51Z

@mheon and @Luap99 I have edited the PR according to your comments.

mheon · 2024-11-06T20:35:49Z

libpod/container_api.go

@@ -136,6 +137,61 @@ func (c *Container) Update(resources *spec.LinuxResources, restartPolicy *string
 	return c.update(resources, restartPolicy, restartRetries)
 }

+// UpdateHealthCheckConfig updates HealthCheck configuration the given container.
+func (c *Container) UpdateHealthCheckConfig(healthCheckConfig *manifest.Schema2HealthConfig, changedTimer bool, noHealthCheck bool) error {


noHealthCheck can probably be healthCheckConfig == nil

And I think we might want to compute changedTimer in here, instead of requiring the user pass it in - moving the rest of the parsing out of libpod was good but this was maybe a step too far.

mheon · 2024-11-06T20:38:16Z

libpod/container_api.go

+
+// UpdateGlobalHealthCheckConfig updates global HealthCheck configuration the given container.
+// If value is nil then value will be not changed.
+func (c *Container) UpdateGlobalHealthCheckConfig(healthLogDestination *string, healthMaxLogCount *uint, healthMaxLogSize *uint, healthCheckOnFailureAction *define.HealthCheckOnFailureAction) error {


Maybe make a GlobalHealthCheckOptions struct that contains all of these? Arguments list is a little long

mheon · 2024-11-06T20:39:50Z

libpod/container_internal.go

@@ -2738,3 +2739,122 @@ func (c *Container) update(resources *spec.LinuxResources, restartPolicy *string

 	return nil
 }
+
+func (c *Container) resetHealthCheckTimers(noHealthCheck bool, changedTimer bool, isStartup bool) error {
+	if !c.ensureState(define.ContainerStateCreated, define.ContainerStateRunning) {


Maybe Paused as well? Do we stop the timers when we pause a container?

I haven't found any code that can stop the timer.

That's interesting. We should talk about this at standup on Monday. Not something we should solve in this PR but it does sound like a bug.

The timer runs even after the container is paused. Systemctl displays an error for a few seconds:

Nov 15 18:08:33 fedora podman[42]: error: container <id> is not running

We can discuss it on standup, but on Monday, I am going to midterm. I can create an Issue so we don't forget about it and then we can discuss it.

HealthCheck timer tries to execute HealthCheck when container is paused #24590

edsantiago

Thank you for updating and rebasing and for your initiative in cleaning up more than I asked for.

Tests LGTM. One comment inline, not a blocker.

edsantiago · 2024-11-13T14:26:20Z

test/system/280-update.bats

+    run_podman healthcheck run $ctrname
+    is "$output" "" "output from 'podman healthcheck run'"
+
+    # Run podman update in two separate runs to make sure HealthCheck is overwritten correctly.


I don't understand this new action (splitting into two commands) nor its comment. I will assume it makes sense to everyone else, and won't block on it but will point it out in case others are puzzled by it too.

Because I found a bug. When you try to change any resource using podman update, it overwrites the current resource configuration. Here is an example:

$ ./bin/podman run -dt --replace --name hc1 quay.io/libpod/alpine:latest top $ ./bin/podman inspect hc1 --format {{.HostConfig.PidsLimit}} 2048 $ ./bin/podman update hc1 --pids-limit 1024 $ ./bin/podman inspect hc1 --format {{.HostConfig.PidsLimit}} 1024 $ ./bin/podman update hc1 --cpus 2 # Change any other resource. $ ./bin/podman inspect hc1 --format {{.HostConfig.PidsLimit}} 0 <— this value should be 1024

I want to make sure this is not happening for Healtcheck as well.

Honny1 · 2024-11-13T14:46:47Z

@mheon @TomSweeneyRedHat @Luap99 I have updated the PR according to your comments.

Luap99

I need some more time to undersatnd this.

Luap99 · 2024-11-13T15:36:23Z

libpod/container_internal.go

+	}
+
+	logrus.Debugf("HealthCheck updated for container %s", c.ID())
+	c.newContainerEvent(events.Update)


This comment here still seems to be open, we should not be creating two events

Luap99 · 2024-11-13T15:38:36Z

libpod/container_internal.go

+	return nil
+}
+
+func (c *Container) updateHealthCheck(newHealthCheckConfig interface{}, currentHealthCheckConfig interface{}, setConfigFunc func(config interface{})) error {


why is this accepting interfaces, this makes API design super fragile and error prone. We should not be mixing so many types. reflection is difficult to understand and use correctly.

Luap99 · 2024-11-14T11:00:36Z

libpod/healthcheck.go

+	// NOTE: This is only safe while being called from "podman healthcheck run" which we know
+	// is the case here as we should not alter the exit code of another process that just
+	// happened to call this.
+	shutdown.SetExitCode(0)


Wait this is no longer true now, if you call this from outside podman healtchcheck run we have big problems espically for the system service as this overwrites the global exit code on signals.

Luap99 · 2024-11-14T11:06:16Z

libpod/container_internal.go

+		c.valid = false
+		err := WithHealthCheckLogDestination(*globalOptions.HealthLogDestination)(c)
+		if err != nil {
+			return err
+		}
+		c.valid = true


That is extremely ugly, setting a container to not valid like this is not sane if the pointer is shared acorrs several threads another container may see a invalid container and bail out. I don't think we that is a huge risk as we not really share these structs at the same time but still.

And calling CtrCreateOption function somehwere else is just not nice to read. Please just split out the validation bits into another function and then call it from here and WithHealthCheckLogDestination()

Luap99 · 2024-11-14T11:10:47Z

pkg/api/handlers/libpod/containers.go

 		utils.Error(w, http.StatusInternalServerError, fmt.Errorf("decode(): %w", err))
 		return
 	}
-	err = ctr.Update(options.Resources, restartPolicy, restartRetries)
+


can you just call (ic *ContainerEngine) ContainerUpdate() from like many other endpoints do to avoid duplication

Luap99 · 2024-11-14T11:13:11Z

pkg/domain/infra/abi/containers.go

+	healthCheckConfig, changedHealthCheck, err := specgenutil.GetNewHealthCheckConfig(container.HealthCheckConfig(), *updateOptions.ChangedHealthCheckConfiguration)
+	if err != nil {
+		return "", err
+	}
+	if changedHealthCheck {
+		if err := container.UpdateHealthCheckConfig(healthCheckConfig); err != nil {
+			return "", err
+		}
+	}
+
+	startupHealthCheckConfig, changedStartupHealthCheck, err := specgenutil.GetNewStartupHealthCheckConfig(container.Config().StartupHealthCheckConfig, *updateOptions.ChangedHealthCheckConfiguration)
+	if err != nil {
+		return "", err
+	}
+	if changedStartupHealthCheck {
+		if err := container.UpdateStartupHealthCheckConfig(startupHealthCheckConfig); err != nil {
+			return "", err
+		}
+	}
+
+	globalHealthCheckOptions, err := specgenutil.GetNewGlobalHealthCheck(*updateOptions.ChangedHealthCheckConfiguration)
+	if err != nil {
+		return "", err
+	}
+	if err := container.UpdateGlobalHealthCheckConfig(globalHealthCheckOptions); err != nil {
+		return "", err
+	}
+
+	if err = container.Update(updateOptions.Specgen.ResourceLimits, restartPolicy, updateOptions.Specgen.RestartRetries); err != nil {


this entire flow makes updates not atomic which seems bad. Every time you call into libpod we have to lock again. sync the state again and then do whatever we do save the db config and state again. Doing this multiple times is slow and not atomic which means ugly race conditions most likely.

I rather have this moved into one libpod function that acts under one lock.

Luap99 · 2024-11-14T11:15:36Z

pkg/specgenutil/specgen.go

+	return uint(valUint), nil
+}
+
+func GetChangedHealthCheckConfiguration(cmd *cobra.Command) define.UpdateHealthCheckConfig {


cobra imports should not be in specgen, cli parts should all stay under cmd/podman.
I See noe reason why the function is here.

Luap99 · 2024-11-14T11:16:07Z

pkg/specgenutil/specgen.go

+			case "health-max-log-size":
+				val, err := parseUint(valStr)
+				if err != nil {
+					logrus.Error(err)


all these errors must be handled and returned.

Luap99 · 2024-11-14T11:17:16Z

pkg/specgenutil/specgen.go

+
+func GetChangedHealthCheckConfiguration(cmd *cobra.Command) define.UpdateHealthCheckConfig {
+	updateHealthCheckConfig := define.UpdateHealthCheckConfig{}
+	cmd.Flags().Visit(func(f *pflag.Flag) {


I don't get why we loop here at all over all flags? Can't we just access all the flags directly if we have to list out all name anyway.

Visit visits only those flags that have been set.

you can use .Changed() on a flag to know if was set or not

Honny1 · 2024-11-18T17:42:08Z

/packit retest-failed

TomSweeneyRedHat · 2024-11-18T23:18:19Z

If @Luap99 is happy with the answers to your conversations, then LGTM
Happy Green Test Buttons.

Luap99

Thanks this looks pretty good now but I think we can get rid of the last interface{} type casting as well.

Luap99 · 2024-11-19T12:26:43Z

libpod/healthcheck_config.go

+	*define.StartupHealthCheck
+}
+
+type IHealthCheckConfig interface {


Thanks, I like this abstraction over the interface{}/any casting with reflect.

Luap99 · 2024-11-19T12:29:03Z

pkg/specgenutil/specgen.go

+	return globalOptions, nil
+}
+
+func getHealthCheckOptions(originalHealthCheckConfig interface{}) define.HealthCheckOptions {


I think this can use the new IHealthCheckConfig type here instead the reflect logic in this function?
You might need to move the IHealthCheckConfig definition out of libpod for that so it can be imported on the remote client. (i.e. move it to libpod/define)

New flags in a `podman update` can change the configuration of HealthCheck when the container is started, without having to restart or recreate the container. This can help determine why a given container suddenly started failing HealthCheck without interfering with the services it provides. For example, reconfigure HealthCheck to keep logs longer than the usual last X results, store logs to other destinations, etc. Fixes: https://issues.redhat.com/browse/RHEL-60561 Signed-off-by: Jan Rodák <[email protected]>

Honny1 · 2024-11-22T10:50:38Z

@Luap99 I reworked the code to reduce code duplication, which removed the use of interface{}.

Luap99

LGTM

openshift-ci · 2024-11-22T13:23:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Honny1, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhatdan · 2024-11-22T15:18:50Z

@mheon PTAL

mheon · 2024-11-22T15:53:36Z

/lgtm

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Nov 1, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2024

github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Nov 1, 2024

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch from 2f18a35 to 475e16f Compare November 1, 2024 07:35

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2024

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch from 475e16f to e052162 Compare November 1, 2024 15:20

openshift-ci bot added release-note and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Nov 1, 2024

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch from e052162 to 432b6d3 Compare November 1, 2024 15:52

Luap99 reviewed Nov 1, 2024

View reviewed changes

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch 2 times, most recently from 621c46c to 96def55 Compare November 1, 2024 19:25

mheon reviewed Nov 4, 2024

View reviewed changes

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch 3 times, most recently from 74eae00 to 4471330 Compare November 6, 2024 16:46

Honny1 marked this pull request as ready for review November 6, 2024 18:39

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2024

mheon reviewed Nov 6, 2024

View reviewed changes

edsantiago approved these changes Nov 13, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2024

Luap99 reviewed Nov 13, 2024

View reviewed changes

Luap99 requested changes Nov 14, 2024

View reviewed changes

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch 5 times, most recently from 9145618 to 7ec1665 Compare November 15, 2024 17:00

Honny1 mentioned this pull request Nov 18, 2024

HealthCheck timer tries to execute HealthCheck when container is paused #24590

Closed

Luap99 reviewed Nov 19, 2024

View reviewed changes

Honny1 mentioned this pull request Nov 19, 2024

podman update resets the configuration of other container resources #24610

Closed

Honny1 force-pushed the change-healthcheck-config-via-podman-update branch from 7ec1665 to a124942 Compare November 19, 2024 19:09

Honny1 requested a review from Luap99 November 22, 2024 10:50

Luap99 approved these changes Nov 22, 2024

View reviewed changes

Honny1 requested review from TomSweeneyRedHat and mheon November 22, 2024 14:43

openshift-ci bot assigned mheon Nov 22, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 22, 2024

openshift-merge-bot bot merged commit d85ac93 into containers:main Nov 22, 2024
78 checks passed

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Feb 21, 2025

stale-locking-app bot locked as resolved and limited conversation to collaborators Feb 21, 2025

Configure HealthCheck with podman update #24442

Configure HealthCheck with podman update #24442

Conversation

Honny1 commented Nov 1, 2024 • edited Loading

Does this PR introduce a user-facing change?

packit-as-a-service bot commented Nov 1, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Honny1 Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

edsantiago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Nov 13, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Nov 18, 2024

TomSweeneyRedHat commented Nov 18, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Nov 22, 2024

Luap99 left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Nov 22, 2024

rhatdan commented Nov 22, 2024

mheon commented Nov 22, 2024

Configure HealthCheck with `podman update` #24442

Configure HealthCheck with `podman update` #24442

Honny1 commented Nov 1, 2024 •

edited

Loading

Honny1 Nov 15, 2024 •

edited

Loading

Honny1 Nov 18, 2024 •

edited

Loading