-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Issue for high failure rates on Windows MSVC CI with filesystem errors #127883
Comments
Do you have a link to a failed pipeline run? |
You can also copy any of the |
|
Another: #128982 (comment) |
Affects my PR #129019 so I thought I'd take a gander at the PML shared here. There were two One of them happened before any [1]:
|
If the Handles tool doesn't reveal anything, a live kernel dump file a PS helper if no GUI installed or Task Manager (right click on the PID 4 and then "create kernel dump") would be the last resort. Goes w/o saying, the kernel dump should be shared if and only if the machine doesn't carry any sensitive information. |
The CI machines probably have sensitive authorization tokens somewhere in their memory, unfortunately. Though T-infra is pretty diligent about the principle of least privilege, they still need to move some data around to some fairly specific buckets. |
We have tried using |
#130569 (comment) failure has a case where:
in addition to a different failure
so there's definitely something stepping over something |
`cc` was previously pinned because it dropped support for Visual Studio 12 (2013), and we wanted to decouple that from the rest of the automated updates. As noted in [2], there is no longer anything indicating we support VS2013, so it should be okay to unpin it. `cc` 1.1.22 contains a fix that may help improve the high MSVC CI failure rate [3], so we also have motivation to update to that point. [1]: rust-lang#129307 [2]: rust-lang#129307 (comment) [3]: rust-lang#127883
`cc` was previously pinned because version 1.1.106 dropped support for Visual Studio 12 (2013), and we wanted to decouple that from the rest of the automated updates. As noted in [2], there is no longer anything indicating we support VS2013, so it should be okay to unpin it. `cc` 1.1.22 contains a fix that may help improve the high MSVC CI failure rate [3], so we also have motivation to update to that point. [1]: rust-lang#129307 [2]: rust-lang#129307 (comment) [3]: rust-lang#127883
Unpin `cc` and upgrade to the latest version `cc` was previously pinned because 1.1.106 dropped support for Visual Studio 12 (2013), and we wanted to decouple that from the rest of the automated updates. As noted in [2], there is no longer anything indicating we support VS2013, so it should be okay to unpin it. `cc` 1.1.22 contains a fix that may help improve the high MSVC CI failure rate [3], so we also have motivation to update to that point. [1]: rust-lang#129307 [2]: rust-lang#129307 (comment) [3]: rust-lang#127883 try-job: x86_64-msvc-ext
Unpin `cc` and upgrade to the latest version `cc` was previously pinned because 1.1.106 dropped support for Visual Studio 12 (2013), and we wanted to decouple that from the rest of the automated updates. As noted in [2], there is no longer anything indicating we support VS2013, so it should be okay to unpin it. `cc` 1.1.22 contains a fix that may help improve the high MSVC CI failure rate [3], so we also have motivation to update to that point. [1]: rust-lang#129307 [2]: rust-lang#129307 (comment) [3]: rust-lang#127883 try-job: x86_64-msvc-ext
Small update; some Cargo.lock updates were hitting a 100% success rate and failing to merge due to these sort of failures on MSVC jobs. The problem was identified to be a change to Anyway, Chris fixed that issue at rust-lang/cc-rs#1215 and that was released in 1.1.22. We just finished bumping Edit: nope, not solved :( #131133 (comment) |
Similar to cerbero, we run meson commands inside a powershell script that will examine the output for spurious errors and re-run that particular command. https://gitlab.freedesktop.org/slomo/gstreamer/-/jobs/65265526 https://gitlab.freedesktop.org/slomo/gstreamer/-/jobs/65265524 https://gitlab.freedesktop.org/nirbheek/gstreamer/-/jobs/65331410 https://gitlab.freedesktop.org/jcowgill/gstreamer/-/jobs/65489856 rust-lang/rust#127883 (comment) Co-Authored-by: L. E. Segovia <[email protected]> Part-of: <https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/7680>
Starting around 2024-06-27, the rust-lang CI has started to encounter a very high failure rate on the MSVC Windows builders (~15% of all builds?). These builders are encountering various filesystems errors, such as "Access is denied", "used by another process", "cannot open file", etc.
If you run into them, please tag the PR with CI-spurious-fail-msvcCI spurious failure: target env msvc
so we can try to find clues about what could be causing the high failures. You can also click on the label to see the list of PRs that were/are affected by this class of failures.
From what I can tell, this doesn't seem to be affecting the GNU builders.
Discussions
Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/242791-t-infra/topic/Spurious.20CI.20errors.20on.20x86_64-msvc-ext
Past Investigations
Starting 2024-07-02, #127152 added a mitigation measure in bootstrap to cover the most common culprit of bootstrap attempting to delete an executable. However, there are several other programs that are still having problems, such as rustc itself, the msvc linker, and cargo.
I have tried using
handle.exe
, and the RestartManager API to try to detect if there is another process with an open handle on a file, but no success.I have tried rolling back the source tree to 2024-06-25, and it still reproduces the problem (before we started seeing it in CI).
The last Windows image release before this started was https://github.com/actions/runner-images/releases/tag/win22%2F20240618.1. From what I can tell, this rolled out about 5 days earlier, during which there weren't any failures, but it is difficult to tell if that could be related.
The last stage0 bump was 2024-06-11, several weeks before it started.
actions/runner-images#4086 is a similar issue we've had in the past, though we don't know what the fix was.
Example
Example errors seen:
C:\a\rust\rust\build\x86_64-pc-windows-msvc\stage1-tools\x86_64-pc-windows-msvc\release\miri.exe
toC:\a\rust\rust\build\x86_64-pc-windows-msvc\stage1-tools-bin\miri.exe
C:\a\rust\rust\build\x86_64-pc-windows-msvc\stage0-rustc\x86_64-pc-windows-msvc\release\rustc-main.exe
toC:\a\rust\rust\build\x86_64-pc-windows-msvc\stage1\bin\rustc.exe
C:\a\rust\rust\build\x86_64-pc-windows-msvc\stage2-tools\x86_64-pc-windows-msvc\release\cargo.exe
Example log entries
Example CI log
The text was updated successfully, but these errors were encountered: