Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CI permissions, auto-merge maintainability, and clarity #1668

Merged
merged 13 commits into from
Nov 12, 2024

Conversation

EliahKagan
Copy link
Member

@EliahKagan EliahKagan commented Nov 11, 2024

This makes several changes to GitHub Actions workflows, which would create merge conflicts if submitted separately, and which even aside from that seemed, to varying degrees, like they would make more sense to do together than separately. But the commits are divided in such a way that breaking this up into a sequence of a few separate PRs would be easy to do, if preferred. There is also an associated recommendation for a change in the repository's settings.

The changes in this PR are as follows:

Use read-only github.token everywhere feasible

This PR sets permissions: explicitly in each workflow and, where different jobs don't all need the same permissions, in individual jobs as well. The permissions are only what each job needs and, in particular, write permissions are given only to the jobs that really require it (the jobs that create or modify releases in the release workflow).

This change is not needed to safeguard against threats from malicious pull requests, and has no effect on such risks. Its primary significance to security is that it slightly decreases the risk from supply-chain attacks. Many GitHub Actions jobs run code from numerous transitive dependencies and other sources. (I say "slightly" because it does not affect the risk when running commands locally, when creating releases, or when using releases.) Secondarily, this may also decrease risks from unintentional vulnerabilities or other bugs in dependencies or in workflows themselves.

The reason this is unrelated to threats from malicious PRs is that, when PRs that originate from forks trigger the pull_request event, jobs in workflows triggered by that event already automatically get a read-only github.token. (This repository currently does not define any workflows that trigger on the pull_request_target or workflow_run events, which would have additional security implications.)

Specific details of this change are given in f41a58c.

I recommend also setting the default permissions of GitHub tokens in GitHub Actions jobs to read-only, by going to Settings → Actions → General and, under Workflow permissions, changing it from "Read and write permissions" to "Read repository contents and package permissions". That way, if a job is ever introduced with no explicit permissions in its definition or the containing workflow, the token will be read-only rather than read-write. That can be done at any time. Doing so without making any other changes will break the release workflow until the workflow is modified to specify the necessary write permissions (a change that this PR includes).

Style workflows more consistently and slightly more spaciously

This applies the prevailing style everywhere, except that it puts blank lines between the top-level nodes of a job. (This does not change the spacing between steps, but rather between, for example, steps itself and other things at the same level.)

It also makes some other improvements, such as replacing the last actions/checkout@master with actions/checkout@v4, quoting env values when the absence of quotes makes them boolean or numeric in YAML (since they are still turned into strings to be placed in the environment, and the string representations are not always obvious), and removing an unnecessary env key of CI, which is guaranteed to be set to true by the runner.

Some of these improvements are directly motivated by the goal of preventing the addition of explicit permissions from making workflows and job definitions harder to read. Those are in 7eeeee2. In view of those, it seemed like a good idea to include the others, which otherwise would conflict with this PR or might conflict with other subsequent PRs, and which are conceptually along the same lines, but because they are not needed to allow permissions to be added without confusion, they are done in a separate commit, 44ff412.

Bump cargo-diet version and express the logic more cleanly

87670a6 bumps cargo-diet to the latest stable version and makes some improvements to the steps, including splitting it into separate installation and running steps and making the running step continue-on-error, instead of writing || true after the command that (through a just rule) runs cargo diet.

Provide richer log output for tests-pass

20794ff adapts a comment in tests-pass to be a step name, so everything is clear in logs even if the reader is unfamiliar with the workflow.

Detect new jobs that may be intended to block auto-merge

This only affects the ci.yml workflow, where the tests-pass job stands in for all the jobs that are meant to block PR auto-merge, by being set up as a required job and depending on the others (#1551). It intentionally does not depend on all jobs in the workflow: of course it does not depend on itself, and it also does not depend on test-fixtures-windows (#1657, #1663). Another change in this PR, described below, also causes it no longer to depend on wasm.

That existing situation unfortunately makes it easy to add a job that it doesn't depend on, by simply forgetting to add it or forgetting to make a decision on whether the new job should block PR auto-merge.

The solution is to maintain two lists of jobs that are checked to ensure are disjoint and form, as their union, all jobs defined in ci.yml. A job to check this is added in 4e672ef, 7b7a819, and eba50ae.

Make the wasm jobs fail the workflow but not block auto-merge

Prior to the changes in this PR, the WebAssembly jobs would report a successful conclusion even if they failed, by using the continue-on-error key at the job level. This was needed to keep them from blocking auto-merge, but only because wasm was listed as a dependency of tests-pass. This changes it so that it will report a failing conclusion (and thus fail the workflow) but not prevent PR auto-merge from working.

Since these jobs are expected to pass--if they fail then it indicates a regression in WASM compatibility, and they rarely fail--this seems like a more suitable approach. See 253b716 for full details.

More broadly, although I do not propose adopting any rule about it, I think we should be reluctant to use continue-on-error at the job level in ci.yml, ever since branch protection rules and required checks for auto-merge have been set up and (since #1551) made partially configurable in the workflow file:

  • The granularity of having jobs that unambiguously fail but do not block automatic merging seems like it should usually capture the situation where the failure represents something important but that nonetheless would not be out of the ordinary to dismiss.
  • Other than to represent something that is temporarily broken but intended to be fixed immediately, it would be strange to have a job triggered by push or pull_request whose failure doesn't represent something important, since such a job should usually either not exist or not run automatically.
  • This reasoning only applies to continue-on-error at the job level and not, for example, in a specific step of a job that has at least one step not marked continue-on-error.

But I do not know of any way for one job to depend on specific matrix jobs while not depending on others. This is inapplicable to the wasm job definition because, although it uses a matrix strategy, the current intent is that neither job generated from it block PRs. But it does apply to the cargo deny job definition, whose advisories job is marked continue-on-error, while its bans licenses sources job is not.

This is a simple job definition, which could be broken up into two non-matrix job definitions with minimal code duplication. It seems to me that this should be done, so that a new advisory causes the job to have a failing conclusion and fail the workflow even as it does not block PRs. I had originally planned to include that in this PR, but I decided not to do so, since the question of whether to do shoud probably be considered separately. There are some interesting subtleties, such as in the effect on when commands like @dependabot merge will go ahead with a merge (relevant because Dependabot security updates are created for all ecosystems, even though github-actions is the only ecosystem that gets Dependabot version updates).

That advisories job is currently failing--both here and if it is re-run on main--due to RUSTSEC-2024-0384 (sebcrozet/instant#52), announcing that instant is unmaintained and web-time should be used instead. I had initially hoped to fix that here, but it is not feasible to do so: we use instant only as a transitive dependency through other crates that are not part of this repository and whose dependencies should be updated to use web-time. That situation is unaffected by any changes in this PR.

Update other GitHub-specific YAML configuration

e45ea25 removes the old Dependabot configuration file for the cargo ecosystem, and adds a comment to the current Dependabot configuration file to explain why that ecosystem is not included. The explanation only links to #144, but accounts for the full current situation (see #143, #144, #1254 (comment), and #1357).

4564a64 removes the empty pull request template file, as discussed in #1667.

(These changes are done here rather than in a separate PR because I was already editing dependabot.yml for style, and also because, as detailed above, I had originally planned to include other changes that might affect Dependbot indirectly, but that are deferred so they can be reviewed separately.)

EliahKagan and others added 12 commits November 10, 2024 16:11
In GitHub Actions workflow files, this separates top-level keys
within a job with blank lines, which was sometimes but not usually
already done, since this seems to improve readability. Besides
that, this also applies the prevailing style more consistently.

This style tweak is in preparation for adding explicit
`permissions` keys (so that doing so won't decrease readability).
This adds `permissions` keys at workflow or job level in a number
of places, usually specifying minimal required permissions.

When a `permissions` key contains at least one subkey setting some
permission (in any way), all permissions it does not have subkeys
for are treated as if they were present with a value of `none`.
This relies on that rather than listing all unneeded permissions
everywhere with `none`.

So most `permissions` added here have only `contents: read`, and:

- The only place where `none` is specified explicitly is in the
  CIFuzz workflow, where no permissions (of those controllable
  through `permissions`) are needed.

- The only place any `write` permissions are specified are
  `contents: write` in the jobs of the release workflow that need
  it. All jobs involved in preparing a release currently have at
  least one step that requires this. But `contents: read` is still
  applied at the workflow level, because the `installation` job
  (which tests installing) does not need any `write` permissions.

Note that some jobs that don't have any write permissions of the
kind that is controlled under a `permissions` key do still perform
writes to data accessible outside of their run: caching (including
creating new caches), and uploading artifacts, still works.

Relevant documentation:

- https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#permissions
- https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idpermissions
- https://docs.github.com/en/rest/authentication/permissions-required-for-github-apps
This makes more stylistic adjustments to YAML workflows files.

(Previous recent adjustments were, for the most part, only those
needed to ensure clarity would be preserved while adding
`permissions:`. This commit does some others that weren't needed
for that.)

The changes made here are:

- Use `actions/checkout@v4`, not `actions/checkout@master`, in the
  one place where `@master` was still used for it. This is not
  strictly speaking a purely stylistic change, since it is
  eventually expected to make a difference, by holding onto major
  version 4 until the version is updated (which Depedabot
  automates).

- Don't explicitly set the `CI` environment variable in the `test`
  job. This variable is guaranteed to be set automatically with the
  value of `true` by the GitHub Actions runner. Furthermore, no
  environment variables to indicate CI are explicitly set for other
  jobs, even though others rely on their presence as well due to
  the use of `is_ci` through `gix-testtools`.

- Specify all environment variables' values as strings. Most
  values in YAML are strings without explicit quoting. This does
  not add quotation marks in those cases. But some values are
  parsed as some other type, such as integer, unless quoted. That
  makes it less obvious what the value will actually be in the
  environment, since it will be implicitly converted to a string,
  which does not always consist of the same sequence of characters
  as the original expression. This effect is most prominent for
  booleans (e.g. unquoted `yes` and `true` in YAML have the same
  meaning) but not entirely limited to them. In addition, such
  expressions may also lead to confusion if it is misread to mean
  that they will retain any kind of type information. So this
  quotes such values (except when other changes here remove them).

- Minor quoting style adjustments, for YAML strings are quoted.
  Omit quotes when clearly not needed, and use single-quotes by
  default when quotes are needed.

- Use 2-space indent inside scripts in script steps for consistency
  with other scripts. Most existing multi-line script steps, as
  well as all shell scripts, use 2-space indents already.

- Remove `\` in a script step where it was not needed for line
  continuation because it followed a `|`.

- In the `wasm` job's script steps, put `name:` and, when present,
  `if:`, ahead of `run:`, so it's clear what each step is for.
  (This was already done where applicable in other jobs.)

- In the `wasm` job's script steps, split single-line `run:` values
  with `&&` into separate commands, one per line, where doing so
  has the same effect due to the `-e` option the CI runner
  automatically passes to `bash` shells unless `shell:` is
  overridden with a value that contains `{0}`.

- In the `wasm` job's script steps,  split single-line `run:`
  values with `;` into separate lines, including running loops over
  multiple lines, for readability.

- In the `wasm` job, extract `${{ matrix.target }}` to an
  environment variable. This seems to make the steps slightly more
  readable, since `wasm` steps make heavy use of it.

- Extremely minor adjustment to array style, for consistency.

- In commands, use `--` where guaranteed to be supported if any
  non-option argument begins with parameter expansion or a glob.
  This was already almost, but not quite, already done. Since the
  possible values can be inferred from nearby code and do not
  begin with `-`, the reason is to clearly communicate that they
  are non-option arguments, rather than to fix any actual bug.
  (This continues to be avoided where is not guaranteed correct.)

- Use the `-e` option before a pattern argument to `grep` formed
  through parameter expansion (same rationale as `--` above).

- Use `-E` rather than `-r` with `sed`, since `-E` is standardized.
In the past (GitoxideLabs#143), Dependabot version updates had been used to
keep `cargo` dependencies up to date. This was removed in favor of
doing manual updates based on automatic reports from `cargo deny`
and the old `dependabot.yml` was kept but renamed to disable it and
point people to GitoxideLabs#144 to learn about the change. Since then,
Dependabot security updates, which are distinct from Dependabot
version updates, were enabled (see GitoxideLabs#1254), and later, Dependabot
version updates were reintroduced for GitHub Actions only (GitoxideLabs#1357).
At that point, there were two Dependabot-related YAML files: the
old disabled one, and the new one for GHA.

This removes the old one, explaining the situation in a comment in
the new one, including a link to GitoxideLabs#144.

While doing so, this also adjusts the YAML code style there, to
bring it in line with the style of most other YAML files in the
repository.
This gives `name:` keys to the `test-pass` steps, turning the
first step's comment into its name. This way, the output can be as
clear as the workflow file itself.
This intentionally fails right now, by omitting `tests-pass` itself
as a job that `tests-pass` should not depend on, in order to ensure
that it is able to fail. Once this is observed (and any other bugs
fixed), this omission should be corrected, and then it should pass.
- Show the job lists for better debugging.
- Reword the `diff` step name to be clearer (albeit less precise).
Changes:

- Bump the version of `cargo-diet` used in the `lint` job from
  1.2.4 to 1.2.7.

- Use the version tag not just as an operand to `--tag`, but also
  as the ref from which the installation script itself is obtained.
  (The rationale is that the effect of skew here would probably be
  unintuitive, and also that we had been specifying `master` but
  the default branch of the `cargo-diet` repository is now `main`.)

- Instead of using `|| true` on the step that runs
  `just check-size` (which needs cargo-diet), split installing
  and running into separate steps, and mark the running step
  `continue-on-error`.
The WebAssembly CI job definition had an unconditional job-level
`continue-on-error` set to `true`, so that it would always report a
successful conclusion, even on failure. This made sense before PR
auto-merge was set up. But these jobs do not typically fail, and
their failures should be more apparent, even if they should not
yet block PRs from being auto-merged.

This commit changes the `wasm` jobs so they are able to fail and
cause the workflow as a whole to have a failing status, while still
not blocking PR auto-merge. `continue-on-error` is removed from the
job definition, but it is also no longer made a dependency of the
required `tests-pass` check.
@Byron Byron force-pushed the run-ci/gha-permissions branch from 98f2c3d to 5173e9a Compare November 12, 2024 07:30
@EliahKagan
Copy link
Member Author

I didn't expect that updating dependencies in Cargo.lock (5173e9a) would be able to fix the cargo deny advisories error about depending on the unmaintained instant crate yet, but it looks like it has. Thanks!

I had totally missed that this could already be fixed easily, because I didn't realize that fastrand had completely stopped depending on instant since 2.0.0 due to smol-rs/fastrand@eb78d89 (smol-rs/fastrand#60), and we had been depending on an earlier version than that. Sorry about my inaccurate claim that the dependency on instant was not possible to remove in this PR!

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this massive improvement! I can't wait to see it merged and adjust the auto-merge configuration to listen to tests-pass instead of a long list of individual jobs.

Screenshot 2024-11-12 at 08 07 09

Thanks a lot for the hint the additional reduction of risk. I assume that this repository setting is only doable if permissions are set correctly on a PR level, so I am glad this is possible now.

This is a simple job definition, which could be broken up into two non-matrix job definitions with minimal code duplication. It seems to me that this should be done, so that a new advisory causes the job to have a failing conclusion and fail the workflow even as it does not block PRs. I had originally planned to include that in this PR, but I decided not to do so, since the question of whether to do shoud probably be considered separately.

I agree, let's merge this and have the WASM-dematrixification (if that's a word ;)) in a separate PR. There has been some interest in continuing the development of WASM-support lately so improving CI in that regard will certainly be helpful to them as well.

I didn't expect that updating dependencies in Cargo.lock (5173e9a) would be able to fix the cargo deny advisories error about depending on the unmaintained instant crate yet, but it looks like it has. Thanks!

No worries at all - at this time I didn't even read up to your assessment of the situation yet and just responded with my standard reflex in these situations 😁. Luckily, it worked.

.github/workflows/ci.yml Show resolved Hide resolved
@Byron Byron enabled auto-merge November 12, 2024 07:49
@EliahKagan
Copy link
Member Author

EliahKagan commented Nov 12, 2024

Thanks so much for this massive improvement! I can't wait to see it merged and adjust the auto-merge configuration to listen to tests-pass instead of a long list of individual jobs.

You give me too much credit, because the improvements here are not needed for that, though I believe they do make it easier to keep it working properly. tests-pass was introduced in #1551 and has been appropriate to mark as the solely required check in ci.yml since then. I am also pretty sure that you have already made it so, since none of the other checks show as required in PRs since around that time.

Edit: Yes, this was reported as done in #1551 (comment).

#1551 did not make it possible for tests-pass to substitute for listing checks from other workflows as required, but nothing in this PR extends it in that way, either. As far as I know, there is no change to this repository's required checks (or any other branch protection related settings) that should be done following the merge of this PR.

@EliahKagan
Copy link
Member Author

EliahKagan commented Nov 12, 2024

CI failed here on 5173e9a with a very strange error. It can be examined by expanding the annotations shown at the top of that page. The error reports that the GHA runner machine was out of disk space. Although it's possible for a new dependency to have a bug that causes massively increased disk space usage, I think that's probably not the cause, because the error does not happen for that same commit on my fork. My guess is that it might go away if the workflow is re-run, which I recommend trying.

Edit: Also, the error is preceded by a warning about low disk space, and the warning seems to have been generated prior to any toolchains or dependencies being installed, so the changes in 5173e9a are unlikely to have triggered it.

Further edit: Unfortunately that appears not to be the case. The error does happen again in 38edb2c (#1670), though again it happens only in the pull_request run here, and not in the push run for the same commit in my fork. For this reason, I am concerned that it might be triggered by the changes here. It's appealing to think this might be due to a transient problem with the GitHub infrastructure, but there are no known recent incidents. Maybe it has something to do with caching in these workflows.

It also passes in my fork on a fork-internal pull_request trigger, both on a PR equivalent to this one (EliahKagan#3), and on a PR that is just 5173e9a cherry-picked (EliahKagan#4). I don't know why this is failing on the upstream repository but passing in my fork. However, I think it is actually failing later than it seems: I think the test step runs, and gets quite far through, but then when the disk is out of space, it is canceled and that step's log is not preserved. I've opened #1671 to see if a new PR here, with an empty commit and then with 5173e9a cherry-picked (comparable to EliahKagan#4), will fail here.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Nov 12, 2024
The problem seems to happen only on the upstream repository, or
maybe only on PRs from forks. This initially makes no changes, to
verify that the problem does not simply happen regardless of
content.

See GitoxideLabs#1668, and in particular
GitoxideLabs#1668 (comment),
as well as #3, and #4, for details.
@Byron
Copy link
Member

Byron commented Nov 12, 2024

You give me too much credit, because the improvements here are not needed for that, though I believe they do make it easier to keep it working properly. tests-pass was introduced in #1551 and has been appropriate to mark as the solely required check in ci.yml since then. I am also pretty sure that you have already made it so, since none of the other checks show as required in PRs since around that time.

Edit: Yes, this was reported as done in #1551 (comment).

#1551 did not make it possible for tests-pass to substitute for listing checks from other workflows as required, but nothing in this PR extends it in that way, either. As far as I know, there is no change to this repository's required checks (or any other branch protection related settings) that should be done following the merge of this PR.

Oh, I see, thanks for the clarification. I was on an entirely different page there apparently.

No space left on device

For posterity, here is the way the runner itself crashes:

System.IO.IOException: No space left on device : '/home/runner/runners/2.320.0/_diag/Worker_20241112-075012-utc.log'
at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan1 buffer, Int64 fileOffset) at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite() at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder) at System.Diagnostics.TextWriterTraceListener.Flush() at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id) at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message) at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message) at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut) at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args) System.IO.IOException: No space left on device : '/home/runner/runners/2.320.0/_diag/Worker_20241112-075012-utc.log' at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan1 buffer, Int64 fileOffset)
at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
at System.Diagnostics.TextWriterTraceListener.Flush()
at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
at GitHub.Runner.Common.Tracing.Error(Exception exception)
at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/runners/2.320.0/_diag/Worker_20241112-075012-utc.log'
at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
at System.Diagnostics.TextWriterTraceListener.Flush()
at System.Diagnostics.TraceSource.Flush()
at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
at GitHub.Runner.Common.TraceManager.Dispose()
at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
at GitHub.Runner.Common.HostContext.Dispose()
at GitHub.Runner.Worker.Program.Main(String[] args)

So it looks like writing a trace to a log fails during normal operation, and that even before anything happens.

Screenshot 2024-11-12 at 11 21 13

The cache is pretty big, but no reason to fail this soon.

On the current try, it gets to running the tests at least and I will keep watching it - maybe it works now and all this was no more than a bad (and transient) dream.

@Byron
Copy link
Member

Byron commented Nov 12, 2024

It really does fail when trying to compile:

   Compiling gix-protocol v0.46.0 (/home/runner/work/gitoxide/gitoxide/gix-protocol)
   Compiling gix-status v0.14.0 (/home/runner/work/gitoxide/gitoxide/gix-status)
   Compiling gix-merge v0.0.0 (/home/runner/work/gitoxide/gitoxide/gix-merge)
   Compiling gix-worktree-state v0.14.0 (/home/runner/work/gitoxide/gitoxide/gix-worktree-state)
   Compiling gix-negotiate v0.16.0 (/home/runner/work/gitoxide/gitoxide/gix-negotiate)
   Compiling gix-mailmap v0.25.0 (/home/runner/work/gitoxide/gitoxide/gix-mailmap)
   Compiling tracing-forest v0.1.6
   Compiling gix v0.67.0 (/home/runner/work/gitoxide/gitoxide/gix)
   Compiling gix-fsck v0.7.0 (/home/runner/work/gitoxide/gitoxide/gix-fsck)
   Compiling env_logger v0.10.2
   Compiling gitoxide-core v0.42.0 (/home/runner/work/gitoxide/gitoxide/gitoxide-core)
rustc-LLVM ERROR: IO failure on output stream: No space left on device
rustc-LLVM ERROR: IO failure on output stream: No space left on device
error: could not compile `gitoxide-core` (lib)
warning: build failed, waiting for other jobs to finish...
error: could not compile `gix` (lib)
error: Recipe `journey-tests-async` failed on line 220 with exit code 101
Error: Process completed with exit code 101.

And then the log gets truncated so when looking at it again, it seems like not much happened. Let me try something….

That way, the runner hopefully doesn't run out of disk space while
it is compiling the various targets used in a journey test.
@EliahKagan
Copy link
Member Author

Great catch--I had at some point wondered why nothing ever called clear-target and I should've thought to ask!

The #1671 test PR here passes initially, and then fails after cherry-picking 5173e9a from here. So the increased space usage is triggered by upgrading the crates, somehow. Maybe this is due to dependencies being heavier, but my guess is that it is instead due to there being more stuff that has to be generated rather than obtained from the GitHub Actions cache. Alternatively, maybe the problem is that it retrieves data from the cache and saves them, and the size of those data, some of which can't be used, combined with the files that are generated, is too big.

Assuming 132696d fixes this, I'll close #1671.

@Byron Byron merged commit d47263b into GitoxideLabs:main Nov 12, 2024
18 checks passed
@EliahKagan
Copy link
Member Author

EliahKagan commented Nov 12, 2024

🎉 That worked! 🎉

@EliahKagan EliahKagan deleted the run-ci/gha-permissions branch November 12, 2024 10:56
@Byron
Copy link
Member

Byron commented Nov 12, 2024

The #1671 test PR here passes initially, and then fails after cherry-picking 5173e9a from here. So the increased space usage is triggered by upgrading the crates, somehow. Maybe this is due to dependencies being heavier, but my guess is that it is instead due to there being more stuff that has to be generated rather than obtained from the GitHub Actions cache. Alternatively, maybe the problem is that it retrieves data from the cache and saves them, and the size of those data, some of which can't be used, combined with the files that are generated, is too big.

Yes, I think it's that!

Now the cache is basically useless, yet it consumes a lot of space. Alternatively, I could have cleared it, probably lovering the footprint significantly.

Now I deleted all caches, 35GB of 10GB available, way above the limit, and think that it might now work with the extra clearing which can cut down the build time a little, presumably.

Also, I think caches will become more inefficient over time as also Cargo doesn't delete old caches, so keeps various versions of the same crate around, each time it changes. At least…so I think, judging by the continuously growing crashes on my disk.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Nov 12, 2024
These didn't make it into GitoxideLabs#1668.

Besides comments, the changes are for consistency with the
prevailing style, usually by omitting redundant YAML quoting.

Removal of the outer double quotes in the `if` in `tests-pass` is a
case of this, and produces an equivalent node in parsing (i.e. its
equivalence does not depend on anything about GHA itself). But just
to be sure, I did run

    yq '.jobs.tests-pass.steps[0].if' .github/workflows/ci.yml

before and after the change, to ensure the output was the same.

The other change here that deserves comment is the removal of `--`
as an argument to a `diff` command. When any path argument is
formed from paramter expansion or from a glob with a leading `*` or
other globbing character, `--` helps express that the following
arguments are not options. For `git diff`, a `--` expresses that
the following arguments are neither options nor refs, but paths, so
all `git diff` commands with paths in the CI workflows use `--`
even if no shell expansions are involved.

(In practice this means `--` is often useful for `diff` with paths
and, based on this habit, I had inadvertently written a `--` where
neither of the above scenarios applied. But that had actually
decreased stylistic consistency because we are not using `--`
elsewhere that the meaning of all arguments after it is unambiguous
even without examining any surrounding context.)
@EliahKagan
Copy link
Member Author

EliahKagan commented Nov 12, 2024

Now the cache is basically useless, yet it consumes a lot of space.

Why is it useless? Is this because, since 132696d, the data that make it worthwhile are now deleted in the ci-test rule when it calls clear-target before calling journey-tests? If so, this would apply only to the test job, and not to any other jobs, right?

Alternatively, I could have cleared it, probably lovering the footprint significantly.

Maybe that is why I was unable to reproduce the problem internally to my fork, even when testing with both push and pull_request triggers: the GitHub Actions cache in my fork was less than 9 GiB large when I checked it a day or two before opening this PR, and furthermore I had also cleared it, to ensure that creating new cache entries did not require an actions: write or other write permission.

Also, I think caches will become more inefficient over time as also Cargo doesn't delete old caches, so keeps various versions of the same crate around, each time it changes. At least…so I think, judging by the continuously growing crashes on my disk.

If the files where this happens are in the target directory, then I think at least in principle the exclusion of files whose modification time is older than 2 weeks should prevent it from growing in this way. That assumes, of course, that the original mtimes metadata are set on files retrieved from the GHA cache. I have not checked to see if that is the case.

Another scheduled workflow could be added that clears all rust-cache-related caches every week or two. I can add that if you want to see if it makes a difference. But maybe it would be better to investigate why mtime checking is not already solving this problem.

I wonder if the reason the cache was so large here in this upstream repository is that, separately from any recent changes, different branches (and different PRs) have their own caches to keep them from poisoning one another's caches, which would sometimes be a correctness problem and, for PRs from forks, a security problem. If I understand the documentation about this correctly, caches from main can be used by PRs, but when PRs save caches, those caches are not available in runs triggered by pushes to main, nor accessible in other PRs.

It is possible for different jobs to share caches, and maybe there are some cases where that would make sense here. In particular, the test-fixtures-windows job seems like it needn't cache independently from the test-fast windows-latest job. So long as the jobs with write permissions in the release workflow do not use caches from other jobs (currently they don't use caches at all, which seems fine to me, as they don't need to run especially fast), I think there should be no significant reduction in security from sharing caches between jobs. Whether that would make anything faster or make the total size of all caches smaller is another question; I am not sure. In addition, there is the problem that the rust-cache action seems like it might not be designed to easily facilitate sharing across separate jobs (Swatinem/rust-cache#205).

@Byron
Copy link
Member

Byron commented Nov 13, 2024

Why is it useless? Is this because, since 132696d, the data that make it worthwhile are now deleted in the ci-test rule when it calls clear-target before calling journey-tests? If so, this would apply only to the test job, and not to any other jobs, right?

Thanks for picking this up. I was blunt with words, and think "inefficient" would have been better. It's probably never useless, it is just reduced in value to the point where it might cost more to maintain the cache than the time it saves.

If the files where this happens are in the target directory, then I think at least in principle the exclusion of files whose modification time is older than 2 weeks should prevent it from growing in this way. That assumes, of course, that the original mtimes metadata are set on files retrieved from the GHA cache. I have not checked to see if that is the case.

While looking up the docs I did realize that I knew nothing at all about how it works - it's much more adapted to the task than I thought. Here is an excerpt:

In particular, the workspace crates themselves are not cached since doing so is Swatinem/rust-cache#37 (comment). For this reason, this action automatically sets CARGO_INCREMENTAL=0 to disable incremental compilation, so that the Rust compiler doesn't waste time creating the additional artifacts required for incremental builds.

If that environment variable actually carries over, then the worst offender of huge target directories would be no more. However, it will still hash its output, even if regenerated from scratch, so multiple versions of the same crate will still be created.
If the 2-week rule holds, then at least there is a limit on how many of those duplicates it would pick up. This also means that dependencies that don't change a lot and remain stable - the case for the majority of the dependency graph - they might fall out of the cache even though they are still very useful.

Whether that would make anything faster or make the total size of all caches smaller is another question; I am not sure. In addition, there is the problem that the rust-cache action seems like it might not be designed to easily facilitate sharing across separate jobs (Swatinem/rust-cache#205).

Making Windows jobs faster would of course be great, but I wouldn't push it if rust-cache isn't made for that.
Regarding the total size of caches, here it was 35GB of 10GB, so I think they are generous in enforcing such limits and I personally wouldn't care about cache-size reduction related to the overall amount stored by GitHub. I would care, however, about the size of the individual cache, but don't think there is any influence. A world where the cache only contains files that are actually used is probably never going to happen.

@EliahKagan
Copy link
Member Author

EliahKagan commented Nov 13, 2024

I certainly have no objection to characterizing caching as useless when its goal is to decrease overall load or running time but instead it increases it. I would also call that useless. (The software used to do the caching can of course still be useful in many, even most, other cases, where it does provide savings.) So it didn't seem to me that you were too blunt at all. Instead, I wanted to understand the source of the inefficiency, as well as figure out if there are any current uses of the rust-cache action in this project's workflows that should be removed.

As it currently stands (with 132696d), the benefit of caching build artifacts in the CI test job seems fairly low, because we:

  1. Retrieve cached artifacts, which are those that support whatever build steps were run between the last cargo clean and writing the cache.
  2. Run check doc unit-tests.
  3. Run clear-target, which runs cargo clean.
  4. Run ci-journey-tests.
  5. Cache artifacts from builds done as part of ci-journey-tests, since others were cleaned.

This is to say that we cache artifacts for ci-journey-tests, but ci-journey-tests never sees those artifacts. Instead, check, doc, and unit-tests get those artifacts, and they may be able to use some of them, since we are talking about dependencies, but it is probably not nearly as good as if they got the artifacts produced from those steps. Then both the artifacts from ci-journey-tests that were retrieved from the cache, and other artifacts from check doc unit-tests, are removed, and ci-journey-tests has to start off with a fresh build.

If the foregoing analysis is correct, then we should try out at least one of these changes in the hope that it will improve running time:

  • Not use the GitHub Actions cache for the CI test job, or
  • Split the test job into two jobs, at the breaking point where we are running ci-journey-tests, by splitting out the running of journey tests to a new job. Then it would no longer be necessary to delete artifacts during the job (i.e., we would omit clear-target).

The second approach is elegant in that it splits the jobs where they are effectively almost being split by the running of cargo clean. At first glance it seems like a perfect solution. But I am actually skeptical: it might turn out that, with what seems to be an increased size of build artifacts for dependencies ever since Cargo.lock was updated, this would--precisely because it would manage to cache everything we want to cache--cause the cache to grow too big.

But I think it's worth trying, so I'll open a PR for it. The running time of the CI test job has increased to about 20 minutes, and maybe this can speed things up. Even if not, it could help with that for reasons unrelated to caching, in that it would allow for more parallelism across runners.

If that environment variable actually carries over

Fortunately, it does.

However, it will still hash its output, even if regenerated from scratch, so multiple versions of the same crate will still be created.

Is this due to the effect of builds being nondeterministic, or something else?

Making Windows jobs faster would of course be great, but I wouldn't push it if rust-cache isn't made for that.

In that case, I will hold off, for now, on attempting to cause different jobs to share caches.

Regarding the total size of caches, here it was 35GB of 10GB, so I think they are generous in enforcing such limits and I personally wouldn't care about cache-size reduction related to the overall amount stored by GitHub.

They are generous in the sense that new cache entries are created, and in that old entries are sometimes dropped gradually rather than immediately. But my understanding of the documentation is that, while it's fine for the cache to grow bigger than 10 GB from time to time, this could significantly worsen performance when it happens frequently:

If you exceed the limit, GitHub will save the new cache but will begin evicting caches until the total size is less than the repository limit. The cache eviction process may cause cache thrashing, where caches are created and deleted at a high frequency. To reduce this, you can review the caches for a repository and take corrective steps, such as removing caching from specific workflows. For more information, see "Managing caches."

@Byron
Copy link
Member

Byron commented Nov 14, 2024

This is to say that we cache artifacts for ci-journey-tests, but ci-journey-tests never sees those artifacts. Instead, check, doc, and unit-tests get those artifacts, and they may be able to use some of them, since we are talking about dependencies, but it is probably not nearly as good as if they got the artifacts produced from those steps. Then both the artifacts from ci-journey-tests that were retrieved from the cache, and other artifacts from check doc unit-tests, are removed, and ci-journey-tests has to start off with a fresh build.

I think on top of that is the issue of having so many feature toggles, which make reuse of workspace dependencies (i.e. gix* crates) unlikely between unit tests and journey tests. Ultimately, I'd think that the only effective portion of the cache is the one for external dependencies.

But I think it's worth trying, so I'll open a PR for it. The running time of the CI test job has increased to about 20 minutes, and maybe this can speed things up. Even if not, it could help with that for reasons unrelated to caching, in that it would allow for more parallelism across runners.

That's exciting, let's try that!

However, it will still hash its output, even if regenerated from scratch, so multiple versions of the same crate will still be created.

Is this due to the effect of builds being nondeterministic, or something else?

It's deterministic, at least that's a goal, but what I am suggesting is that each code change will cause a new file to be created. The build output changes, and so does the hashed portion of the filename of the build artefact.

But I'd take my 'ideas' with a grain of salt as it's really just that, an idea on how this works based on using it. There is certainly more to it, and I have seen cache-trashing happening when switching between compiler versions in the same project, or when changing environment variables, which is something I'd expect not to be present if that hashing-story was entirely true.

hey are generous in the sense that new cache entries are created, and in that old entries are sometimes dropped gradually rather than immediately. But my understanding of the documentation is that, while it's fine for the cache to grow bigger than 10 GB from time to time, this could significantly worsen performance when it happens frequently:

Even that didn't happen here. Or if it happened, it happened too slowly. After all it seemed easy to generated caches larger than 10GB. After deleting the cache a couple of days ago entirely, it's now approaching 10GB already and I didn't work on gitoxide at all. I will keep watching.

@EliahKagan
Copy link
Member Author

EliahKagan commented Nov 14, 2024

That's exciting, let's try that!

I hope to open a PR soon. In the mean time, I've opened the question #1673 to inquire about something that may help keep things clear in the changes in that forthcoming PR, but that I believe need not block that PR.

Edit: PR #1674

EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Nov 14, 2024
In the `justfile`, this renames the old `ci-test` recipe to
`ci-test-full`, and has `ci-test` no longer clean the target
directory nor run journey tests.

The CI `test` job thus remains the same, but it does moderately
less work. A new CI job, `test-journey`, is introduced to run the
journey tests (still via `ci-test-journey` recipe).

This change is intended to allow greater parallelism, and possibly
make caches work better. The CI `test` job has sometimes been a few
minutes slower than before, ever since 5173e9a (GitoxideLabs#1668). See
comments in GitoxideLabs#1668 for some discussion on this change.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Nov 14, 2024
In the `justfile`, this renames the old `ci-test` recipe to
`ci-test-full`, and has `ci-test` no longer clean the target
directory nor run journey tests.

The CI `test` job thus remains the same, but it does moderately
less work. A new CI job, `test-journey`, is introduced to run the
journey tests (still via `ci-test-journey` recipe).

This change is intended to allow greater parallelism, and possibly
make caches work better. The CI `test` job has sometimes been a few
minutes slower than before, ever since 5173e9a (GitoxideLabs#1668). See
comments in GitoxideLabs#1668 for some discussion on this change.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Nov 14, 2024
In the `justfile`, this renames the old `ci-test` recipe to
`ci-test-full`, and has `ci-test` no longer clean the target
directory nor run journey tests.

The CI `test` job thus remains the same, but it does moderately
less work. A new CI job, `test-journey`, is introduced to run the
journey tests (still via `ci-test-journey` recipe).

This change is intended to allow greater parallelism, and possibly
make caches work better. The CI `test` job has sometimes been a few
minutes slower than before, ever since 5173e9a (GitoxideLabs#1668). See
comments in GitoxideLabs#1668 for some discussion on this change.
@EliahKagan
Copy link
Member Author

EliahKagan commented Nov 14, 2024

Another setting shown in #1668 (review) that can probably be tightened is:

Allow GitHub Actions to create and approve pull requests

Currently we have no workflows that create or review pull requests. Furthermore, I believe this does not need to be enabled to allow Dependabot to open security updates and version updates, even if the setting to run Dependabot on GitHub Actions runners is enabled, so long as one is not doing something like implementing automatic merging of passing Dependabot pull requests, which we are not (and which I think we are unlikely to want here).

I don't think it's really a security issue, at least with workflows that trigger on the events these do, to have this enabled. But it can always be easily reenabled if GHA workflows that create or approve PRs are ever added.

@Byron
Copy link
Member

Byron commented Nov 14, 2024

Thanks for the hint - it's disabled now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants