Skip to content

WIP: Test GOCACHE/GOMODCACHE caching for unit tests #25951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .cirrus.yml
Original file line number Diff line number Diff line change
Expand Up @@ -571,6 +571,19 @@ unit_test_task:
env:
TEST_FLAVOR: unit
clone_script: *get_gosrc
go_unit_build_cache:
fingerprint_script: &go_fingerprint_script # FIXME: Almost certainly insufficient
- echo "$DISTRO_NV"
- echo "$VM_IMAGE_NAME"
- echo "$CTR_FQIN"
- echo "$CI_DESIRED_DATABASE"
- echo "$CI_DESIRED_STORAGE"
- echo "$PRIV_NAME"
Comment on lines +576 to +581
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just using VM_IMAGE_NAME sounds best, I don't see why the others would matter here at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point that this needs reexamining — this should not capture runtime options like things we write into storage.conf. We might want to cache builds for different Go versions, or different build tags, individually, to maximize cache coverage.

- cat go.sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use go.sum than we have a ton of PRs invalidating the cache all the time, I mean it is likely a good thing to start with as we don't want the cache to keep growing forever but podman has times where we merge more than one renovate PR a a day so I am not sure how we we gain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have a good intuition for choice between having an accurate cache and not needing to build anything, in some cases, at the cost of frequently building without a cache — vs. having a more stable cache, and almost always having some cache benefit, but almost always building something.

At minimum, I think we should definitely aim to have successive small changes to a single PR reuse the cache, so that subsequent force-pushes are tested quickly.

And there might be more that we can do, e.g. it might be a big enough win to do caching even only restricted to a single PR (compiling+running unit tests on one element of the matrix first, caching the outcome, then running the rest of the matrix afterwards, using the cache; same for integration tests). (Also, we might want to cache build outputs but to explicitly discard test outcome caches, to fully exercise the test matrix.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want the cache to keep growing forever

Reading the docs:

  • If using Cirrus’ hosted tests, we pay for CPU time but nothing for storage. There seems to be some limit to the size of an individual cache item, but it’s unclear what the total cache size limit or lifetime is.
  • If using custom runners or private cloud accounts, the cache is stored as cloud storage in the private account; and it seems that the storage is (by default?) configured to expire cache items after 90 days (see “Default Logs Retentions Period” in https://cirrus-ci.org/guide/supported-computing-services/ ). So, the cache lifetime might be limited, but the total storage size would indeed depend on how frequently the fingerprint changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caches are stored in our GCE/AWS accounts as S3 storage AFAICT. So we do pay for that, but storage is generally cheaper then cpu time so that should most likely always be better.

One thing we run into with netavark (where we already make use of quite large caches) is that if the cache gets to large you loose all the speed benefit because the download/upload of the cache take much longer. So a good fingerprint is important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default lifetime is set to 90 days I belive, same for all our CI logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that

  • the modules cache ($GOPATH/pkg/mod) is useful even when some dependencies changed;
  • by default the cache is re-uploaded (i.e. updated) if there are any changes.

With that, we can probably be fine using just the go version for the go_unit_mod_cache fingerprint, which means it will depend on

  • a version of Go (which might be necessary since a newer version might use a newer layout for go/pkg/mod);
  • GOOS/GOARCH.

The only downside is, since cache is updated/reuploaded, it will end up accumulating older versions of packages that we no longer need. This vicious cycle will break once Go version is updated.

As for the build cache ($GOCACHE), I am not quite sure if go version is sufficient for the fingerprint_script, but maybe it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go version may not change for a long time, i.e. at least around a month on main but on our old RHEL branches basically never, sure such branches should not see much changes so maybe it would not actually grow but that would worry me. Not sure how would much the cache would grow in that time, maybe we could add something like date +%U to keep at most one week?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • by default the cache is re-uploaded (i.e. updated) if there are any changes.

That’s not what I see: reupload_on_changes defaults to false if there are fingerprints defined, and logs do say that they are not looking for changes.

Also, conceptually, I think we don’t want state. In principle, the tests should be isolated and reproducible; introducing long-term state adds risk. (OTOH, admittedly, we do trust the Go runtime to compile/run things correctly anyway; and, also, the whole point of the cache is to reintroduce some state.)

From that point of view, I think it is cleaner to have reupload_on_changes = false and to make the cache a ~pure function of the fingerprint inputs, even if that invalidates the cache a bit more frequently; than having a “monthly cache” that grows and is reset just based on a calendar.

But, after all, this is a thing that can be measured and decided based on numbers.

And, ultimately, it’s much more important to do some version of the optimization and have it in, than to get this cache design perfect, so please don’t take me all that seriously.

folder: $GOCACHE
go_unit_mod_cache:
fingerprint_script: *go_fingerprint_script
folder: $GOPATH/pkg/mod # FIXME: This doesn’t contain any files
setup_script: *setup
main_script: *main
always: *logs_artifacts
Expand Down