-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci(disks): truncate disk names and improve disk name search in PRs #4966
Conversation
Previous behavior: Some tests were not able to create a disk image as the following error was being raised in GCP: Invalid value for field 'resource.name'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)' Expected behavior: Disk image names in GCP have a 61 characters max length constraint, and disks names are being created with a length between 71 and 84 chars. Solution: This refactors the disk naming and the way we find the latest cached disk image. Instead of searching the commit SHA of the current "running" commit, the first search is focused on finding a disk from the actual PR, as matching the actual commit will rarely be a match unless re-running a job, and even in this scenario we can search in the PR and the latest disk will be returned from a successful run Here's a disk name example with the longest name from a PR after this changes are applied: `zebrad-cache-4962-merge-v25-mainnet-checkpoint-update` (53 chars long). If a disk is manually triggered using `workflow_dispatch` the branch name must be <= 18 characters so we truncate the branch name at disk creation and while searching for an available disk name for the same branch.
Unfortunately, the branch names in this PR will cause a regression to bug #4842, because the disk names are not unique. To be unique, the disk names need to contain the commit and timestamp. We also don't want to prioritise previous commits from the same PR, because we want to be testing with the output of the current commit. (We've had bugs where the cached state created by previous commits worked, but the current commit was failing. While we can't avoid them entirely due to performance reasons, I don't want to make them more likely.) |
I don't think we're prioritizing "previous commits" with this approach, we're just prioritizing the latest/actual commit from the same PR/branch, as we're filtering images by creation date.
To workaround #4842 we can do as with templates, delete it if exists and create a new one. But as this error just happens when re-running on the same commit, the previous cached state and the actual cached state should be the same, we're just making the jobs run longer as we're not considering a potentially healthy image and rebuilding it again (with the same code). As far as I'm concern, we don't expect Zebra to generate different cached states with the exact same codebase. |
That's not quite how it works, because searching for the branch name will choose either:
Option 1 has caused bugs to be merged to main before, because the previous commit created a working state, but the current commit (which is what gets merged) didn't. We can't avoid this situation entirely without sacrificing CI performance, because that would mean always building from a main branch cached state, which is slow. But if we only prioritise the current commit when searching for a cache state, merging previous buggy states to main should be rare enough that it doesn't matter. That's why I just search for the commit in PR #4962. Searching for the commit also ignores the different GITHUB_REFs of manually triggered jobs, pushed branches, and PR merge branches. So it helps with that too.
I don't want to delete the cached state, because that creates a race condition - there are a few minutes where the old cached state has been deleted, but the new image hasn't been created yet. This means any jobs that start during that time will have different behaviour. (They'll definitely be slower, and they might behave differently or fail.) There's also a complex interaction with #4965 that I don't really want to have to think through, I'd prefer to only delete old cached states in one place. Instead, I made the image names unique in PR #4962 by adding a timestamp. That way there's no gap with a missing image.
Future jobs will actually run shorter, because:
Since the chain is always adding new blocks, the same code creates different cached states at different times - the later ones have a higher block height. But there's shouldn't be any other differences (unless there's a bug in the PR). |
Ok, now I'm understanding the edge cases better. Thank you for taking the time for explaining it. I'd like to add this same answer somewhere in a document, maybe the same one you created for continuous integration |
That sounds good, maybe we want to have the details next to the code, so we don't accidentally break things. I'll think about it later this week, and do an update to the workflow files. Once I've done that, I'll have a better idea of how I want to summarise it in the developer docs. |
Previous behavior:
Some tests were not able to create a disk image as the following error was being raised
in GCP: Invalid value for field 'resource.name'. Must be a match of regex '(?:a-z?)'
Expected behavior:
Disk image names in GCP have a 61 characters max length constraint, and disks names
are being created with a length between 71 and 84 chars.
Solution:
This refactors the disk naming and the way we find the latest cached disk image.
Instead of searching the commit SHA of the current "running" commit, the first
search is focused on finding a disk from the actual PR, as matching the actual
commit will rarely be a match unless re-running a job, and even in this scenario
we can search in the PR and the latest disk will be returned from a successful run
Here's a disk name example with the longest name from a PR after this changes are
applied:
zebrad-cache-4962-merge-v25-mainnet-checkpoint-update
(53 chars long).If a disk is manually triggered using
workflow_dispatch
the branch name mustbe <= 18 characters so we truncate the branch name at disk creation and while
searching for an available disk name for the same branch.
Fixes #4295
Review
This PR should be reviewed by @teor2345
Reviewer Checklist
Follow Up Work