ci(disks): truncate disk names and improve disk name search in PRs #4966

gustavovalverde · 2022-08-27T22:44:06Z

Previous behavior:

Some tests were not able to create a disk image as the following error was being raised
in GCP: Invalid value for field 'resource.name'. Must be a match of regex '(?:a-z?)'

Expected behavior:

Disk image names in GCP have a 61 characters max length constraint, and disks names
are being created with a length between 71 and 84 chars.

Solution:

This refactors the disk naming and the way we find the latest cached disk image.
Instead of searching the commit SHA of the current "running" commit, the first
search is focused on finding a disk from the actual PR, as matching the actual
commit will rarely be a match unless re-running a job, and even in this scenario
we can search in the PR and the latest disk will be returned from a successful run

Here's a disk name example with the longest name from a PR after this changes are
applied: zebrad-cache-4962-merge-v25-mainnet-checkpoint-update (53 chars long).
If a disk is manually triggered using workflow_dispatch the branch name must
be <= 18 characters so we truncate the branch name at disk creation and while
searching for an available disk name for the same branch.

Fixes #4295

Review

This PR should be reviewed by @teor2345

Reviewer Checklist

Tests for Expected Behaviour
Tests for Errors

Follow Up Work

Previous behavior: Some tests were not able to create a disk image as the following error was being raised in GCP: Invalid value for field 'resource.name'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)' Expected behavior: Disk image names in GCP have a 61 characters max length constraint, and disks names are being created with a length between 71 and 84 chars. Solution: This refactors the disk naming and the way we find the latest cached disk image. Instead of searching the commit SHA of the current "running" commit, the first search is focused on finding a disk from the actual PR, as matching the actual commit will rarely be a match unless re-running a job, and even in this scenario we can search in the PR and the latest disk will be returned from a successful run Here's a disk name example with the longest name from a PR after this changes are applied: `zebrad-cache-4962-merge-v25-mainnet-checkpoint-update` (53 chars long). If a disk is manually triggered using `workflow_dispatch` the branch name must be <= 18 characters so we truncate the branch name at disk creation and while searching for an available disk name for the same branch.

teor2345 · 2022-08-28T00:46:56Z

This refactors the disk naming and the way we find the latest cached disk image. Instead of searching the commit SHA of the current "running" commit, the first search is focused on finding a disk from the actual PR, as matching the actual commit will rarely be a match unless re-running a job, and even in this scenario we can search in the PR and the latest disk will be returned from a successful run

Unfortunately, the branch names in this PR will cause a regression to bug #4842, because the disk names are not unique. To be unique, the disk names need to contain the commit and timestamp.

We also don't want to prioritise previous commits from the same PR, because we want to be testing with the output of the current commit. (We've had bugs where the cached state created by previous commits worked, but the current commit was failing. While we can't avoid them entirely due to performance reasons, I don't want to make them more likely.)

gustavovalverde · 2022-08-28T04:40:00Z

We also don't want to prioritise previous commits from the same PR

I don't think we're prioritizing "previous commits" with this approach, we're just prioritizing the latest/actual commit from the same PR/branch, as we're filtering images by creation date.

Unfortunately, the branch names in this PR will cause a regression to bug #4842

To workaround #4842 we can do as with templates, delete it if exists and create a new one. But as this error just happens when re-running on the same commit, the previous cached state and the actual cached state should be the same, we're just making the jobs run longer as we're not considering a potentially healthy image and rebuilding it again (with the same code). As far as I'm concern, we don't expect Zebra to generate different cached states with the exact same codebase.

teor2345 · 2022-08-28T07:43:00Z

We also don't want to prioritise previous commits from the same PR

I don't think we're prioritizing "previous commits" with this approach, we're just prioritizing the latest/actual commit from the same PR/branch, as we're filtering images by creation date.

That's not quite how it works, because searching for the branch name will choose either:

an updated state from the previous commit on the same branch, or
an updated state from the current commit on the same branch

Option 1 has caused bugs to be merged to main before, because the previous commit created a working state, but the current commit (which is what gets merged) didn't.

We can't avoid this situation entirely without sacrificing CI performance, because that would mean always building from a main branch cached state, which is slow.

But if we only prioritise the current commit when searching for a cache state, merging previous buggy states to main should be rare enough that it doesn't matter. That's why I just search for the commit in PR #4962.

Searching for the commit also ignores the different GITHUB_REFs of manually triggered jobs, pushed branches, and PR merge branches. So it helps with that too.

Unfortunately, the branch names in this PR will cause a regression to bug #4842

To workaround #4842 we can do as with templates, delete it if exists and create a new one.

I don't want to delete the cached state, because that creates a race condition - there are a few minutes where the old cached state has been deleted, but the new image hasn't been created yet. This means any jobs that start during that time will have different behaviour. (They'll definitely be slower, and they might behave differently or fail.)

There's also a complex interaction with #4965 that I don't really want to have to think through, I'd prefer to only delete old cached states in one place.

Instead, I made the image names unique in PR #4962 by adding a timestamp. That way there's no gap with a missing image.

But as this error just happens when re-running on the same commit, the previous cached state and the actual cached state should be the same, we're just making the jobs run longer

Future jobs will actually run shorter, because:

they'll use the most recent state for this commit, which is likely to be close to the tip
the next job will use the updated state from this repeated run, which will be even closer to the tip

as we're not considering a potentially healthy image and rebuilding it again (with the same code). As far as I'm concern, we don't expect Zebra to generate different cached states with the exact same codebase.

Since the chain is always adding new blocks, the same code creates different cached states at different times - the later ones have a higher block height.

But there's shouldn't be any other differences (unless there's a bug in the PR).

gustavovalverde · 2022-08-28T11:47:35Z

Ok, now I'm understanding the edge cases better. Thank you for taking the time for explaining it. I'd like to add this same answer somewhere in a document, maybe the same one you created for continuous integration

teor2345 · 2022-08-28T11:54:31Z

Ok, now I'm understanding the edge cases better. Thank you for taking the time for explaining it. I'd like to add this same answer somewhere in a document, maybe the same one you created for continuous integration

That sounds good, maybe we want to have the details next to the code, so we don't accidentally break things. I'll think about it later this week, and do an update to the workflow files. Once I've done that, I'll have a better idea of how I want to summarise it in the developer docs.

teor2345 · 2022-08-28T22:48:32Z

These edge cases are documented in PRs #4972 and #4973.

gustavovalverde added C-bug Category: This is a bug A-infrastructure Area: Infrastructure changes A-devops Area: Pipelines, CI/CD and Dockerfiles P-Critical 🚑 I-integration-fail Continuous integration fails, including build and test failures labels Aug 27, 2022

gustavovalverde self-assigned this Aug 27, 2022

gustavovalverde requested a review from a team as a code owner August 27, 2022 22:44

gustavovalverde requested review from teor2345 and removed request for a team August 27, 2022 22:44

gustavovalverde mentioned this pull request Aug 27, 2022

fix(ci): Expand cached state disks before running tests #4962

Merged

1 task

gustavovalverde closed this Aug 28, 2022

gustavovalverde deleted the fix-ci-name-lenght-issues branch November 1, 2022 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(disks): truncate disk names and improve disk name search in PRs #4966

ci(disks): truncate disk names and improve disk name search in PRs #4966

gustavovalverde commented Aug 27, 2022

teor2345 commented Aug 28, 2022

gustavovalverde commented Aug 28, 2022 •

edited

Loading

teor2345 commented Aug 28, 2022

gustavovalverde commented Aug 28, 2022

teor2345 commented Aug 28, 2022 •

edited

Loading

teor2345 commented Aug 28, 2022

ci(disks): truncate disk names and improve disk name search in PRs #4966

ci(disks): truncate disk names and improve disk name search in PRs #4966

Conversation

gustavovalverde commented Aug 27, 2022

Previous behavior:

Expected behavior:

Solution:

Review

Reviewer Checklist

Follow Up Work

teor2345 commented Aug 28, 2022

gustavovalverde commented Aug 28, 2022 • edited Loading

teor2345 commented Aug 28, 2022

gustavovalverde commented Aug 28, 2022

teor2345 commented Aug 28, 2022 • edited Loading

teor2345 commented Aug 28, 2022

gustavovalverde commented Aug 28, 2022 •

edited

Loading

teor2345 commented Aug 28, 2022 •

edited

Loading