-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network zlib stream error confusion #8517
network zlib stream error confusion #8517
Comments
This looks reminiscent of #8258 which was fixed in libgit2/libgit2#5536. Rust 1.45 did not include that PR, but 1.46 (beta) and nightly both have the fix. Do you know if these errors are showing up using Cargo from 1.45? (bootstrap not updated?) |
I'm pretty sure it is the version posted above
|
Bah. It probably wouldn't hurt to throw this onto the list of spurious errors, but I agree it'd be good to understand. AFAIK zlib inflate errors indicate a corrupt data (e.g. not zlib) getting fed to the zlib inflater. That previous issue was due to a bug in libgit2 (valid zlib stream, invalidly fed), and this may also be that. This'll be really hard to track down though without a reliable reproduction. Skimming libgit2's commit history I don't see much related to this since that original fix, so I don't think we can get lucky by "simply update the git repo". |
This may be a bad band-aid for now, but the goal is to help address rust-lang#8517 where this has been showing up in the wild quite a lot.
Flag git zlib errors as spurious errors This may be a bad band-aid for now, but the goal is to help address #8517 where this has been showing up in the wild quite a lot.
Another instance: https://dev.azure.com/rust-lang/rust/_build/results?buildId=35201&view=logs&j=339a807e-05eb-50a3-f69a-0d7a3a5b2cef&t=1f87e16f-7721-57ff-dca8-75b52e5f4eac I wonder if there is any significance that these cases happen on Windows i686 (i686-mingw-1 and i686-msvc-1). |
Hm the fact that it was retried there and it didn't help anything makes me think that it's not a spurious failure... |
Hm, looking at other instances: rust-lang/rust#74245 (comment) rust-lang/rust#74245 (comment) rust-lang/rust#74245 (comment) rust-lang/rust#74611 (comment) rust-lang/rust#74495 (comment) rust-lang/rust#74520 (comment) rust-lang/rust#74219 (comment) Raising priority since this seems important. |
It cannot be spurious, it affects probably all i686 platforms. |
Notes from my investigation: This seems to have gone away for now, though I would anticipate it happening again since it has been coming and going. My most likely hypothesis is that there is some caching layer between azure's network and github's git servers that is misbehaving. AFAIK, all of the rust-lang/rust jobs start with an empty registry directory (based on my testing with GHA and Azure, they both start with an empty directory, and rust-lang/rust's CI jobs are not configured to cache it). So I don't think it is related to any partial caching, or incremental git updates. In rust-lang/rust#74689 I attempted to use
Interestingly, that failed on several different hosts (dist-i686, dist-x86_64-linux, dist-x86_64-linux-alt). I'm still really curious why previous failures all seemed to be on the i686 jobs. During the outage, I was able to easily reproduce on github actions. It didn't really matter which version of Cargo I was using: Unfortunately the errors stopped before I could perform more testing. So I don't think it is related to any recent libgit2 changes (and since it also failed with Will revisit this if it ever starts acting up again and I'm around to catch it. Unfortunately there isn't really a good way to be alerted when it does. 😦 |
Updates from 2020-07-30 investigation: Saw errors on GitHub Actions. Debug logging showed it consistently failing with i686 to the IP 140.82.114.4. Was able to consistently repro locally against that IP as well from about 10am till 11:50am (PDT). Other IPs that were not failing include 140.82.112.3 140.82.112.4 140.82.113.3 140.82.113.4. Could not repro with git command-line or 64-bit cargo. With trace debugging, saw that it consistently failed after downloading around 75MB of data. The last few lines always looked like this:
which leads me to believe it finished downloading the entire pack file, since that is what normal behavior looks like (a bunch of 16384 lines, then a final partial line (9673) and then the "left intact" line). Counting the number of bytes against a successful download shows pretty much the same amount of data downloaded. Ran with a debugger to investigate the part in libgit2 where it fails. This is where it fails. The local variables I was able to capture:
I am suspicious of this line which does not check if the return value is NULL. Although, So, it looks like it was asked to unpack 35 bytes. It looks like it succeeded, but for some reason still set A hypothesis is that there is some edge case in the zstream inflate routine that is getting triggered, which the 32-bit build is sensitive to. It's not clear what. For next time I wish I had:
Another thing I'm uncertain about: The "mwindow" is a window into the pack file. I would suspect at the end of the compressed stream in the pack file, this window includes the 20-byte SHA1 checksum at the end of the pack file. Is it possible that zlib was confused by that extra 20-bytes somehow? |
I found I had an extra copy of the problematic index. I have posted a repro here: https://github.com/ehuss/zlib-repro I haven't really figured out much so far. The git_mwindow stuff is kinda wonky, and the window sizes on 32-bit are a lot different from 64-bit. In 64-bit, the window is fairly consistent (the entire size of the pack file which is 76,488,803 bytes). But in 32-bit, the sizes change a lot, I can't figure out what it is doing. |
I've filed an upstream issue at libgit2/libgit2#5590 |
Update git2. Closes rust-lang#8517 Closes rust-lang#8588 Closes rust-lang#8352 Closes rust-lang#4777 Closes rust-lang#8746 I only added a test for one of these. I can add for the others if you want.
[beta] backport libgit2 This is a backport of #8778 to beta. The reason is that #8517 has started showing up again on rust-lang/rust CI, and I think it is important to fix that. This is risky, because this is a large update with a lot of untested changes. However, I think the risk is worth it to fix #8517. This also includes #8772 to get Cargo's CI to pass. I think the risk for that is low (the [changes](toml-rs/toml-rs@0.5.6...0.5.7) are small). However, I'd be fine with just modifying the tests to pass if you want to drop it.
…acrum [beta] Update cargo 1 commits in 75615f8e69f748d7ef0df7bc0b064a9b1f5c78b2..65cbdd2dc0b7e877577474b98b7d071308d0bb6f 2020-09-29 18:42:19 +0000 to 2020-10-14 16:10:15 +0000 - [beta] backport libgit2 (rust-lang/cargo#8780) This is primarily to bring in a fix for rust-lang/cargo#8517 which is starting to show up on rust-lang/rust's CI again.
…crum Bump bootstrap compiler Mainly to bring in rust-lang#77953 to fix rust-lang/cargo#8517.
Bump bootstrap compiler Mainly to bring in rust-lang#77953 to fix rust-lang/cargo#8517.
Reopening since this unfortunately wasn't completely fixed. |
@alexcrichton You mentioned maybe using Another option would be to pre-fetch the registry using a 64-bit Cargo. Either way, it is not a trivial change to make. |
Update git2 Hopefully this will finally fix #8517 (and hopefully won't cause much breakage). Major changes: - Requires libgit2 1.1: rust-lang/git2-rs#632 - Update libgit2: rust-lang/git2-rs#642 (changes: libgit2/libgit2@7f4fa17...530d37b)
Problem
On rust-lang/rust CI, I've been noticing this error happening a lot lately:
I think the CI has been having network problems in general, but this error in particular seems a little confusing to me. I don't know what would cause this error. I think it might be helpful to understand this error case a little better.
Here is the corresponding error in libgit2.
Should this error class be added to the spurious error list? What can possibly cause this to happen?
Notes
cargo 1.46.0-beta (4f74d9b2a 2020-07-08)
The text was updated successfully, but these errors were encountered: