Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a staging workflow to reduce costs #2031

Closed
wants to merge 11 commits into from

Conversation

basil
Copy link
Member

@basil basil commented Apr 29, 2023

This PR proposes turning master into a staging area as a cost reduction strategy.

The idea is as follows: for PRs to master (including all Dependabot PRs), we only run prep.sh (in the future, we could add a very small amount of Launchable testing to increase confidence) and mark the build as passing, unless a skip-subset label is present. For master builds, we only run prep.sh as well. This effectively turns master into a staging area where untested/broken code may be present.

To ensure quality releases, we would stop doing releases from the master branch and start doing them from a new release branch. On some cadence (e.g., Wednesday, Friday, and Sunday), a GitHub action runs to merge the master branch into the release branch. There should never be any conflicts here, since the only commits to the release branch will be coming from this job, which is doing a one-way sync from master to release. Unlike PRs to master, PRs to the release branch do run the full test suite, so we can catch any problems and fix them throughout the week. Non-PR builds of release fail just like current non-PR builds of master to avoid duplicated work.

So how would releases happen in this system? I am not too sure. The maven-cd workflow currently in use does not support doing releases from non-default branches and relies on https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#check_run which "will only trigger a workflow run if the workflow file is on the default branch." In our case we want to only do releases from the release branch, since that is the only branch that has been tested. For now this PR triggers a build of the release branch on Monday in preparation for a Tuesday CD release of some sort but does not actually implement the necessary changes in .github/workflows/cd.yaml. I would be looking for someone else to work on this.

The idea is that once this is adopted, the vast majority of builds would be "subset" builds, with "subset" being currently defined as just prep.sh but eventually including some tests from Launchable to improve our confidence. The only "full" builds would be on Wednesday, Friday, and Sunday when PRs are created against the release branch, as well as any non-PR builds of the release branch right before an actual release.

This would drastically reduce costs while still retaining stability. Maintainers would have to merge the automatically created PRs on Wednesday, Friday, and Sunday, and bisect/revert any offending changes if necessary. (Those changes would be fixed & resubmitted with the skip-subset label to properly test them before re-integration.) This seems like a small price to pay for a more sustainable operational deployment.


This is just a starting point, so please propose alternative ideas and start implementing them. It is much easier to talk about a proposal when looking at code (even if the code is still incomplete). For example, an alternative proposal would be to run the error step for any Dependabot PR build (making them unmergable) and write a more complicated GitHub Action workflow to periodically (say, Wednesday, Friday, and Sunday) gather these into a batch and offer a combined PR for full testing. That would be harder to implement, and while it would solve the CD problems described above, it would introduce new problems such as merge conflicts. It also would not decrease costs for non-Dependabot, non-combined PRs. So for these reasons, I didn't try this out, but I would not be upset if someone else made a competing PR with an alternative strategy for discussion.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clarifications on GHA:

The maven-cd workflow currently in use does not support doing releases from non-default branches

Not in response to the check_run trigger, i.e., automatically when sufficient labels are present. You can however trigger releases explicitly (workflow_dispatch) from other another branch, provided that the head of the branch in fact has a passing check.

For frequently released components this would be a hassle, since you have to merge the stuff you want, then wait for Jenkins to finish, then come back and approve a release. For bom getting a release every week or two it could be tolerated. We already sometimes choose to trigger a manual release, just to ship accumulated dependencies PRs that would not individually trigger a release according to interesting-category-action.

I am not sure that the release-drafter action as currently configured would work to populate release notes according to PRs merged to master, and making this reliable could be hard given that some of the most recent such PRs might not be included in the current release.

In this scheme we could consider automerging DB PRs especially given that there would be an extra point of human intervention (the masterrelease merge) before plugin bumps appear in releases.

.github/workflows/cd.yaml Outdated Show resolved Hide resolved
.github/workflows/cd.yaml Outdated Show resolved Hide resolved
@basil basil marked this pull request as ready for review May 1, 2023 18:02
@basil
Copy link
Member Author

basil commented May 1, 2023

Looking for an Apache-style vote on the idea from the following people who regularly contribute non-operational changes to this repository (some of whom are also members of @jenkinsci/bom-developers):

Note that current costs are not sustainable and we have a mandate to decrease them, so if you are about to cast a negative vote then please be prepared to discuss (and possibly implement, if the scope is large enough) alternatives. If you intend to abstain, then please let me know explicitly if possible so I am not wondering whether or not you got the ping.

@timja
Copy link
Member

timja commented May 2, 2023

Is the release branch going to be a PR into master?

i.e. where and how will it be visible that the release branch is failing. and how will people be notified that a merge has caused a broken master build. I'm not that happy with a (likely relatively high, without investigating too much) chance of a potential failing release build forcing the releaser to go and investigate at the point of release rather than a the point of pull request

Example: I need to release bom to smooth dependency updates elsewhere but the build is failing preventing me releasing it.


Would another option be to use merge queues?

Something like don't build expensive CI by default.
but on merge queue do build it.

dependabot PRs are automatically added to merge queue, and the PRs are scheduled so should all come in around the same time and could set a min time of 30mins / 1 hour maybe.

image

@jglick jglick requested review from dduportal and a team May 2, 2023 12:16
@jglick jglick added the chore Reduces future maintenance label May 2, 2023
@jglick
Copy link
Member

jglick commented May 2, 2023

where and how will it be visible that the release branch is failing. and how will people be notified that a merge has caused a broken master build

My understanding from the PR description is that we would notice when the Wed/Fri/Sun synch builds fail. Someone would need to be paying attention to those, and manually bisect any problems. This would mean cases where a plugin or core update actually broke tests in some other plugin. The more common case is just that a plugin depends on something newer than the oldest LTS line or there is some other POM issue, which should be caught in the initial PR during the prep stage.

dependabot PRs are automatically added to merge queue, and the PRs are scheduled so should all come in around the same time

That might work. It does seem to be true that DB tends to file a batch of several PRs all at once. If it proves practical, it would be more usable than the current proposal, though it would not control costs as aggressively (would like @dduportal to offer guidance). I think we would still keep #1993 in that case: once the merge queue is good, there is no reason to rebuild master unless you are actually proposing a release.

I have been trying to follow https://github.com/orgs/community/discussions/46757, where a lot of people attempt to distinguish PR head builds from merge queue builds this way, but docs seem unclear. My attempts at using the merge queue for similar situations were not successful: while it does queue up and merge PRs, it retests each one individually, preventing any cost reduction. That may have been due to ignorance of some option.

@basil
Copy link
Member Author

basil commented May 2, 2023

where and how will it be visible that the release branch is failing.

Automation will file a pull request three times a week to merge the master branch into the release branch, which (other than actually performing a release) is the only time the full test suite will run. If the previously filed pull request against release was not merged yet, the automation will simply update it with the latest master changes.

how will people be notified that a merge has caused a broken master build.

Note that we will always be running prep.sh so the only possible cause of a broken master build under this proposal is a failing test. But this would only be noticed when syncing the master branch into the release branch. In this proposal, automation will do this and file a PR against the release branch three times a week, running the full test suite. I don't think this will be that much more work for maintainers, since they already have to dig into and investigate test failures already. This will make that process a bit more annoying by adding some increased bisection work, but such is the compromise necessary to achieve a reasonable budget.

I'm not that happy

Well we have to reduce costs somehow. By definition that is going to create some developer unhappiness. But I would argue that running the full test suite on every pull request was never an operationally sustainable design to begin with.

Would another option be to use merge queues?

I do not believe so. Merge queues test each PR individually as mentioned above.

@timja
Copy link
Member

timja commented May 2, 2023

Ok thanks for that info +1

@jglick
Copy link
Member

jglick commented May 2, 2023

+0 assuming release-drafter can be made to work somehow. The advantage compared to #2034 is that trunk failures would wind up producing a GitHub notification, though in the context of a PR that cannot be edited; if automated notification is important, it would suffice to use GHA to file an issue resulting from @nightly failures, or send a notification from Jenkinsfile for example to a designated email list. Still TBD whether there will be many such failures anyway.

@MarkEWaite
Copy link
Contributor

+1 from me

@basil
Copy link
Member Author

basil commented May 2, 2023

if automated notification is important

I think regular scheduled test runs and automated notification is critical to provide a positive incentive to regularly perform maintainership duties. Without this, maintainership duties will fall to whoever ends up performing a BOM release or doing some unrelated core/PCT work that happens to require a full test run, which will negatively incentivize that valuable work.

If #2034 implemented regular scheduled test runs and automated notification, it could outcompete this PR because of its simpler single-branch model. I would welcome this, as I believe working solutions should be evaluated based on technical merit.

As #2034 currently stands, it is a less complete version of what is being proposed here that creates negative incentives for performing BOM releases or core/PCT work that happens to require a full test run, so it would not meet my criteria for consideration as a working alternative solution.

jglick added a commit to jglick/bom that referenced this pull request May 3, 2023
MarkEWaite pushed a commit that referenced this pull request May 3, 2023
* Skip PCT by default on PRs

* Run full tests weekly

* Empty commit should suffice

* `git branch -D` might be necessary

Co-authored-by: Joseph Petersen <[email protected]>

* Copying DB automerge workflow from #2031

* Revert #2032 for `prep` phase, retaining node pool for PCT #2034 (comment)

---------

Co-authored-by: Joseph Petersen <[email protected]>
@basil basil closed this May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Reduces future maintenance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants