WIP: Test GOCACHE/GOMODCACHE caching for unit tests #25951

mtrmac · 2025-04-22T21:34:28Z

containers/image#2837 suggests this might be an important speedup. Let’s start with unit tests first (scoped to each entry of the test matrix individually), as a proof of concept.

Does this PR introduce a user-facing change?

None

openshift-ci · 2025-04-22T21:34:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrmac

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mtrmac]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99 · 2025-04-23T09:56:52Z

.cirrus.yml

+            - echo "$DISTRO_NV"
+            - echo "$VM_IMAGE_NAME"
+            - echo "$CTR_FQIN"
+            - echo "$CI_DESIRED_DATABASE"
+            - echo "$CI_DESIRED_STORAGE"
+            - echo "$PRIV_NAME"


just using VM_IMAGE_NAME sounds best, I don't see why the others would matter here at all.

Good point that this needs reexamining — this should not capture runtime options like things we write into storage.conf. We might want to cache builds for different Go versions, or different build tags, individually, to maximize cache coverage.

Luap99 · 2025-04-23T09:58:54Z

.cirrus.yml

+            - echo "$CI_DESIRED_DATABASE"
+            - echo "$CI_DESIRED_STORAGE"
+            - echo "$PRIV_NAME"
+            - cat go.sum


If we use go.sum than we have a ton of PRs invalidating the cache all the time, I mean it is likely a good thing to start with as we don't want the cache to keep growing forever but podman has times where we merge more than one renovate PR a a day so I am not sure how we we gain.

I don’t have a good intuition for choice between having an accurate cache and not needing to build anything, in some cases, at the cost of frequently building without a cache — vs. having a more stable cache, and almost always having some cache benefit, but almost always building something.

At minimum, I think we should definitely aim to have successive small changes to a single PR reuse the cache, so that subsequent force-pushes are tested quickly.

And there might be more that we can do, e.g. it might be a big enough win to do caching even only restricted to a single PR (compiling+running unit tests on one element of the matrix first, caching the outcome, then running the rest of the matrix afterwards, using the cache; same for integration tests). (Also, we might want to cache build outputs but to explicitly discard test outcome caches, to fully exercise the test matrix.)

we don't want the cache to keep growing forever

Reading the docs:

If using Cirrus’ hosted tests, we pay for CPU time but nothing for storage. There seems to be some limit to the size of an individual cache item, but it’s unclear what the total cache size limit or lifetime is.

If using custom runners or private cloud accounts, the cache is stored as cloud storage in the private account; and it seems that the storage is (by default?) configured to expire cache items after 90 days (see “Default Logs Retentions Period” in https://cirrus-ci.org/guide/supported-computing-services/ ). So, the cache lifetime might be limited, but the total storage size would indeed depend on how frequently the fingerprint changes.

The caches are stored in our GCE/AWS accounts as S3 storage AFAICT. So we do pay for that, but storage is generally cheaper then cpu time so that should most likely always be better.

One thing we run into with netavark (where we already make use of quite large caches) is that if the cache gets to large you loose all the speed benefit because the download/upload of the cache take much longer. So a good fingerprint is important.

The default lifetime is set to 90 days I belive, same for all our CI logs.

Note that

the modules cache ($GOPATH/pkg/mod) is useful even when some dependencies changed;

by default the cache is re-uploaded (i.e. updated) if there are any changes.

With that, we can probably be fine using just the go version for the go_unit_mod_cache fingerprint, which means it will depend on

a version of Go (which might be necessary since a newer version might use a newer layout for go/pkg/mod);

GOOS/GOARCH.

The only downside is, since cache is updated/reuploaded, it will end up accumulating older versions of packages that we no longer need. This vicious cycle will break once Go version is updated.

As for the build cache ($GOCACHE), I am not quite sure if go version is sufficient for the fingerprint_script, but maybe it is.

go version may not change for a long time, i.e. at least around a month on main but on our old RHEL branches basically never, sure such branches should not see much changes so maybe it would not actually grow but that would worry me. Not sure how would much the cache would grow in that time, maybe we could add something like date +%U to keep at most one week?

by default the cache is re-uploaded (i.e. updated) if there are any changes.

That’s not what I see: reupload_on_changes defaults to false if there are fingerprints defined, and logs do say that they are not looking for changes.

Also, conceptually, I think we don’t want state. In principle, the tests should be isolated and reproducible; introducing long-term state adds risk. (OTOH, admittedly, we do trust the Go runtime to compile/run things correctly anyway; and, also, the whole point of the cache is to reintroduce some state.)

From that point of view, I think it is cleaner to have reupload_on_changes = false and to make the cache a ~pure function of the fingerprint inputs, even if that invalidates the cache a bit more frequently; than having a “monthly cache” that grows and is reset just based on a calendar.

But, after all, this is a thing that can be measured and decided based on numbers.

And, ultimately, it’s much more important to do some version of the optimization and have it in, than to get this cache design perfect, so please don’t take me all that seriously.

Signed-off-by: Miloslav Trmač <[email protected]>

... to see the effectiveness of the other cache Signed-off-by: Miloslav Trmač <[email protected]>

mtrmac · 2025-04-23T17:40:29Z

This looks promising:
https://cirrus-ci.com/task/5111693690273792 vs. https://cirrus-ci.com/task/6390113489387520 decreases unit test time from 7:07 to 2:25, and that seems to be (?) without caching test outcomes, and without caching module downloads.

Luap99 · 2025-04-23T17:49:53Z

That looks good, that said unit tests are not in the critical path (i.e. they don't hold up overall test time since there are many more slower tasks). Still worth to do of course but I would be more interested in the build task caching number personally.

without caching module downloads

There should be no downloads here as it should use the vendored files. The only tasks that downloads the modules is the regular build task that does the make vendor call to check if the tree is clean after that. So we would only need to cache modules for that task.

mtrmac · 2025-04-23T17:57:26Z

Ah, thanks, the pre-vendored dependencies might indeed be an explanation.

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none labels Apr 22, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2025

Luap99 reviewed Apr 23, 2025

View reviewed changes

WIP: Test GOCACHE/GOMODCACHE caching for unit tests

154fde8

Signed-off-by: Miloslav Trmač <[email protected]>

mtrmac force-pushed the cache-experiment branch from 9a1363e to 154fde8 Compare April 23, 2025 16:12

Add a FIXME, and trigger a test re-run

dc351ac

... to see the effectiveness of the other cache Signed-off-by: Miloslav Trmač <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Test GOCACHE/GOMODCACHE caching for unit tests #25951

WIP: Test GOCACHE/GOMODCACHE caching for unit tests #25951

mtrmac commented Apr 22, 2025

openshift-ci bot commented Apr 22, 2025

Luap99 Apr 23, 2025

mtrmac Apr 23, 2025

Luap99 Apr 23, 2025

mtrmac Apr 23, 2025

mtrmac Apr 23, 2025

Luap99 Apr 23, 2025

Luap99 Apr 23, 2025

kolyshkin Apr 23, 2025

Luap99 Apr 24, 2025

mtrmac Apr 29, 2025

mtrmac commented Apr 23, 2025

Luap99 commented Apr 23, 2025

mtrmac commented Apr 23, 2025

WIP: Test GOCACHE/GOMODCACHE caching for unit tests #25951

Are you sure you want to change the base?

WIP: Test GOCACHE/GOMODCACHE caching for unit tests #25951

Conversation

mtrmac commented Apr 22, 2025

Does this PR introduce a user-facing change?

openshift-ci bot commented Apr 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrmac commented Apr 23, 2025

Luap99 commented Apr 23, 2025

mtrmac commented Apr 23, 2025