-
Notifications
You must be signed in to change notification settings - Fork 2.6k
WIP: Test GOCACHE/GOMODCACHE caching for unit tests #25951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mtrmac The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- echo "$DISTRO_NV" | ||
- echo "$VM_IMAGE_NAME" | ||
- echo "$CTR_FQIN" | ||
- echo "$CI_DESIRED_DATABASE" | ||
- echo "$CI_DESIRED_STORAGE" | ||
- echo "$PRIV_NAME" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just using VM_IMAGE_NAME sounds best, I don't see why the others would matter here at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point that this needs reexamining — this should not capture runtime options like things we write into storage.conf
. We might want to cache builds for different Go versions, or different build tags, individually, to maximize cache coverage.
- echo "$CI_DESIRED_DATABASE" | ||
- echo "$CI_DESIRED_STORAGE" | ||
- echo "$PRIV_NAME" | ||
- cat go.sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use go.sum than we have a ton of PRs invalidating the cache all the time, I mean it is likely a good thing to start with as we don't want the cache to keep growing forever but podman has times where we merge more than one renovate PR a a day so I am not sure how we we gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t have a good intuition for choice between having an accurate cache and not needing to build anything, in some cases, at the cost of frequently building without a cache — vs. having a more stable cache, and almost always having some cache benefit, but almost always building something.
At minimum, I think we should definitely aim to have successive small changes to a single PR reuse the cache, so that subsequent force-pushes are tested quickly.
And there might be more that we can do, e.g. it might be a big enough win to do caching even only restricted to a single PR (compiling+running unit tests on one element of the matrix first, caching the outcome, then running the rest of the matrix afterwards, using the cache; same for integration tests). (Also, we might want to cache build outputs but to explicitly discard test outcome caches, to fully exercise the test matrix.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want the cache to keep growing forever
Reading the docs:
- If using Cirrus’ hosted tests, we pay for CPU time but nothing for storage. There seems to be some limit to the size of an individual cache item, but it’s unclear what the total cache size limit or lifetime is.
- If using custom runners or private cloud accounts, the cache is stored as cloud storage in the private account; and it seems that the storage is (by default?) configured to expire cache items after 90 days (see “Default Logs Retentions Period” in https://cirrus-ci.org/guide/supported-computing-services/ ). So, the cache lifetime might be limited, but the total storage size would indeed depend on how frequently the fingerprint changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caches are stored in our GCE/AWS accounts as S3 storage AFAICT. So we do pay for that, but storage is generally cheaper then cpu time so that should most likely always be better.
One thing we run into with netavark (where we already make use of quite large caches) is that if the cache gets to large you loose all the speed benefit because the download/upload of the cache take much longer. So a good fingerprint is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default lifetime is set to 90 days I belive, same for all our CI logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that
- the modules cache (
$GOPATH/pkg/mod
) is useful even when some dependencies changed; - by default the cache is re-uploaded (i.e. updated) if there are any changes.
With that, we can probably be fine using just the go version
for the go_unit_mod_cache fingerprint, which means it will depend on
- a version of Go (which might be necessary since a newer version might use a newer layout for go/pkg/mod);
- GOOS/GOARCH.
The only downside is, since cache is updated/reuploaded, it will end up accumulating older versions of packages that we no longer need. This vicious cycle will break once Go version is updated.
As for the build cache ($GOCACHE
), I am not quite sure if go version
is sufficient for the fingerprint_script, but maybe it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go version may not change for a long time, i.e. at least around a month on main but on our old RHEL branches basically never, sure such branches should not see much changes so maybe it would not actually grow but that would worry me. Not sure how would much the cache would grow in that time, maybe we could add something like date +%U
to keep at most one week?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- by default the cache is re-uploaded (i.e. updated) if there are any changes.
That’s not what I see: reupload_on_changes
defaults to false
if there are fingerprints defined, and logs do say that they are not looking for changes.
Also, conceptually, I think we don’t want state. In principle, the tests should be isolated and reproducible; introducing long-term state adds risk. (OTOH, admittedly, we do trust the Go runtime to compile/run things correctly anyway; and, also, the whole point of the cache is to reintroduce some state.)
From that point of view, I think it is cleaner to have reupload_on_changes = false
and to make the cache a ~pure function of the fingerprint inputs, even if that invalidates the cache a bit more frequently; than having a “monthly cache” that grows and is reset just based on a calendar.
But, after all, this is a thing that can be measured and decided based on numbers.
And, ultimately, it’s much more important to do some version of the optimization and have it in, than to get this cache design perfect, so please don’t take me all that seriously.
Signed-off-by: Miloslav Trmač <[email protected]>
9a1363e
to
154fde8
Compare
... to see the effectiveness of the other cache Signed-off-by: Miloslav Trmač <[email protected]>
This looks promising: |
That looks good, that said unit tests are not in the critical path (i.e. they don't hold up overall test time since there are many more slower tasks). Still worth to do of course but I would be more interested in the build task caching number personally.
There should be no downloads here as it should use the vendored files. The only tasks that downloads the modules is the regular build task that does the |
Ah, thanks, the pre-vendored dependencies might indeed be an explanation. |
containers/image#2837 suggests this might be an important speedup. Let’s start with unit tests first (scoped to each entry of the test matrix individually), as a proof of concept.
Does this PR introduce a user-facing change?