Skip to content

WIP: Test GOCACHE/GOMODCACHE caching for unit tests #25951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mtrmac
Copy link
Collaborator

@mtrmac mtrmac commented Apr 22, 2025

containers/image#2837 suggests this might be an important speedup. Let’s start with unit tests first (scoped to each entry of the test matrix individually), as a proof of concept.

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none labels Apr 22, 2025
Copy link
Contributor

openshift-ci bot commented Apr 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrmac

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2025
Comment on lines +576 to +581
- echo "$DISTRO_NV"
- echo "$VM_IMAGE_NAME"
- echo "$CTR_FQIN"
- echo "$CI_DESIRED_DATABASE"
- echo "$CI_DESIRED_STORAGE"
- echo "$PRIV_NAME"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just using VM_IMAGE_NAME sounds best, I don't see why the others would matter here at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point that this needs reexamining — this should not capture runtime options like things we write into storage.conf. We might want to cache builds for different Go versions, or different build tags, individually, to maximize cache coverage.

- echo "$CI_DESIRED_DATABASE"
- echo "$CI_DESIRED_STORAGE"
- echo "$PRIV_NAME"
- cat go.sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use go.sum than we have a ton of PRs invalidating the cache all the time, I mean it is likely a good thing to start with as we don't want the cache to keep growing forever but podman has times where we merge more than one renovate PR a a day so I am not sure how we we gain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have a good intuition for choice between having an accurate cache and not needing to build anything, in some cases, at the cost of frequently building without a cache — vs. having a more stable cache, and almost always having some cache benefit, but almost always building something.

At minimum, I think we should definitely aim to have successive small changes to a single PR reuse the cache, so that subsequent force-pushes are tested quickly.

And there might be more that we can do, e.g. it might be a big enough win to do caching even only restricted to a single PR (compiling+running unit tests on one element of the matrix first, caching the outcome, then running the rest of the matrix afterwards, using the cache; same for integration tests). (Also, we might want to cache build outputs but to explicitly discard test outcome caches, to fully exercise the test matrix.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want the cache to keep growing forever

Reading the docs:

  • If using Cirrus’ hosted tests, we pay for CPU time but nothing for storage. There seems to be some limit to the size of an individual cache item, but it’s unclear what the total cache size limit or lifetime is.
  • If using custom runners or private cloud accounts, the cache is stored as cloud storage in the private account; and it seems that the storage is (by default?) configured to expire cache items after 90 days (see “Default Logs Retentions Period” in https://cirrus-ci.org/guide/supported-computing-services/ ). So, the cache lifetime might be limited, but the total storage size would indeed depend on how frequently the fingerprint changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caches are stored in our GCE/AWS accounts as S3 storage AFAICT. So we do pay for that, but storage is generally cheaper then cpu time so that should most likely always be better.

One thing we run into with netavark (where we already make use of quite large caches) is that if the cache gets to large you loose all the speed benefit because the download/upload of the cache take much longer. So a good fingerprint is important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default lifetime is set to 90 days I belive, same for all our CI logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that

  • the modules cache ($GOPATH/pkg/mod) is useful even when some dependencies changed;
  • by default the cache is re-uploaded (i.e. updated) if there are any changes.

With that, we can probably be fine using just the go version for the go_unit_mod_cache fingerprint, which means it will depend on

  • a version of Go (which might be necessary since a newer version might use a newer layout for go/pkg/mod);
  • GOOS/GOARCH.

The only downside is, since cache is updated/reuploaded, it will end up accumulating older versions of packages that we no longer need. This vicious cycle will break once Go version is updated.

As for the build cache ($GOCACHE), I am not quite sure if go version is sufficient for the fingerprint_script, but maybe it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go version may not change for a long time, i.e. at least around a month on main but on our old RHEL branches basically never, sure such branches should not see much changes so maybe it would not actually grow but that would worry me. Not sure how would much the cache would grow in that time, maybe we could add something like date +%U to keep at most one week?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • by default the cache is re-uploaded (i.e. updated) if there are any changes.

That’s not what I see: reupload_on_changes defaults to false if there are fingerprints defined, and logs do say that they are not looking for changes.

Also, conceptually, I think we don’t want state. In principle, the tests should be isolated and reproducible; introducing long-term state adds risk. (OTOH, admittedly, we do trust the Go runtime to compile/run things correctly anyway; and, also, the whole point of the cache is to reintroduce some state.)

From that point of view, I think it is cleaner to have reupload_on_changes = false and to make the cache a ~pure function of the fingerprint inputs, even if that invalidates the cache a bit more frequently; than having a “monthly cache” that grows and is reset just based on a calendar.

But, after all, this is a thing that can be measured and decided based on numbers.

And, ultimately, it’s much more important to do some version of the optimization and have it in, than to get this cache design perfect, so please don’t take me all that seriously.

@mtrmac mtrmac force-pushed the cache-experiment branch from 9a1363e to 154fde8 Compare April 23, 2025 16:12
... to see the effectiveness of the other cache

Signed-off-by: Miloslav Trmač <[email protected]>
@mtrmac
Copy link
Collaborator Author

mtrmac commented Apr 23, 2025

This looks promising:
https://cirrus-ci.com/task/5111693690273792 vs. https://cirrus-ci.com/task/6390113489387520 decreases unit test time from 7:07 to 2:25, and that seems to be (?) without caching test outcomes, and without caching module downloads.

@Luap99
Copy link
Member

Luap99 commented Apr 23, 2025

That looks good, that said unit tests are not in the critical path (i.e. they don't hold up overall test time since there are many more slower tasks). Still worth to do of course but I would be more interested in the build task caching number personally.

without caching module downloads

There should be no downloads here as it should use the vendored files. The only tasks that downloads the modules is the regular build task that does the make vendor call to check if the tree is clean after that. So we would only need to cache modules for that task.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Apr 23, 2025

Ah, thanks, the pre-vendored dependencies might indeed be an explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants