Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coreos-installer test segfaults on s390x-unknown-linux-gnu #77382

Closed
cuviper opened this issue Oct 1, 2020 · 16 comments · Fixed by #77985
Closed

coreos-installer test segfaults on s390x-unknown-linux-gnu #77382

cuviper opened this issue Oct 1, 2020 · 16 comments · Fixed by #77985
Assignees
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-linux Operating system: Linux O-SystemZ Target: SystemZ processors (s390x) P-critical Critical priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@cuviper
Copy link
Member

cuviper commented Oct 1, 2020

xref: https://bugzilla.redhat.com/show_bug.cgi?id=1883457

In the Fedora build of coreos-installer 0.7.0, one of the tests crashed with SIGSEGV on s390x, while it passed on every other architecture. The reporter in Red Hat bugzilla found that it happens equally with system LLVM 11 or 10. The crate's Cargo.toml also has lto = true, but they didn't see any improvement after turning that off.

The distro builds run tests in release mode (to avoid recompiling from the main build) with flags in .cargo/config:

[build]
rustc = "/usr/bin/rustc"
rustdoc = "/usr/bin/rustdoc"
rustflags = ["-Copt-level=3", "-Cdebuginfo=2", "-Clink-arg=-Wl,-z,relro,-z,now", "-Ccodegen-units=1", "--cap-lints=warn", ]

The tests pass if we don't set -Ccodegen-units=1, which suggests to me that some relevant (mis)optimization is only triggered with the full unit.

The tests are also fine with Fedora's rust-1.45.2, only crashing on rust-1.46.0. I also reproduced the crash on rustup's stable 1.46.0, and I used cargo bisect-rustc to narrow down to nightly-2020-07-05 passing, while nightly-2020-07-06 crashes. That's 0cd7ff7...2753fab, which is just #73879 (cc @ecstatic-morse). I don't know if codegen-units would affect that, or if it's just triggering something new in LLVM.

Here's the reporter's backtrace:

#0  0x000002aa0c3616c4 in core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#1  core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#2  core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#3  core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#4  core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#5  <libcoreinst::source::FileLocation as libcoreinst::source::ImageLocation>::sources (self=0x3ffcc3fc4f8) at src/source.rs:135
#6  libcoreinst::download::tests::test_write_image_limit () at src/download.rs:489
  1. https://github.com/coreos/coreos-installer/blob/79b59763bd17afba9ac6b13dfdfef401e2d9bde9/src/source.rs#L135
  2. https://github.com/coreos/coreos-installer/blob/79b59763bd17afba9ac6b13dfdfef401e2d9bde9/src/download.rs#L489
@cuviper cuviper added O-linux Operating system: Linux A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-bug Category: This is a bug. O-SystemZ Target: SystemZ processors (s390x) regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Oct 1, 2020
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Oct 1, 2020
@ecstatic-morse
Copy link
Contributor

ecstatic-morse commented Oct 1, 2020

cc #74551, another miscompilation on a tier 2 architecture that appeared after a similar optimization. If the root cause is the same, we should have a better chance of diagnosing it now, since it occurs in a crate's test case instead of while compiling Firefox. cc @pnkfelix as well.

@camelid camelid added A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Oct 1, 2020
@apiraino
Copy link
Contributor

apiraino commented Oct 1, 2020

Assigning P-critical as discussed as part of the Prioritization Working Group procedure and removing I-prioritize.

@apiraino apiraino added P-critical Critical priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Oct 1, 2020
ecstatic-morse added a commit to ecstatic-morse/rust that referenced this issue Oct 1, 2020
Its purpose is to assist in debugging rust-lang#77382 and rust-lang#74551.
@cuviper
Copy link
Member Author

cuviper commented Oct 2, 2020

Is -Zmir-opt-level=0 supposed to avoid #73879's changes? That was mentioned in zulip #t-compiler/meetings, but I still get the crash with that option on nightly-2020-07-06.

@ecstatic-morse
Copy link
Contributor

ecstatic-morse commented Oct 2, 2020

No. That person was mistaken. Drop elaboration runs unconditionally as part of "post_borrowck_cleanup".

@cuviper
Copy link
Member Author

cuviper commented Oct 2, 2020

Updating for cross-reference: I tried -Zprecise-enum-drop-elaboration=no in #77423 (comment), but it didn't help.

@ecstatic-morse
Copy link
Contributor

ecstatic-morse commented Oct 2, 2020

As a sanity check, was there any difference in the generated MIR for <libcoreinst::source::FileLocation as libcoreinst::source::ImageLocation>::sources? I tested the flag locally and it worked, but we should make sure. If there's no trivial error with the flag, I'm not sure what's going on.

I guess we could with the more precise version of MaybeInitializedPlaces but the less precise version of MaybeUninitializedPlaces (that was the status quo before #73879 ). I would be surprised if we had a miscompilation when using the precise version of both (master) or the imprecise version of both (master with -Zprecise-enum-drop-elaboration or a release prior to #68528) but not with one precise and one imprecise (a release between #68528 and #73879).

@cuviper
Copy link
Member Author

cuviper commented Oct 2, 2020

Here are the test's target/release/deps/libcoreinst* after --emit=mir,llvm-ir -Csave-temps:

toolchain download
✔️ nightly-2020-07-05 target-2020-07-05.tar.gz
nightly-2020-07-06 target-2020-07-06.tar.gz
nightly-2020-10-01 target-2020-10-01.tar.gz
#77423 -Zprecise-enum-drop-elaboration=no target-pr77423.tar.gz

They were compiled on Fedora Rawhide (34), in case anyone wants to find libraries to actually run the test executables.

@cuviper
Copy link
Member Author

cuviper commented Oct 3, 2020

Here's std::io::error::Repr for reference:

enum Repr {
    Os(i32),
    Simple(ErrorKind),
    Custom(Box<Custom>),
}

From <libcoreinst::source::FileLocation as libcoreinst::source::ImageLocation>::sources:

            Err(err) => {
                eprintln!("Couldn't read signature file: {}", err);
                None
            }

The error we get during the test case is Repr::Os(2), ENOENT, "No such file or directory", which prints fine.

With nightly-2020-07-05, the assembly after printing is:

+3102>       brasl   %r14,0x2aa0034d340 <std::io::stdio::_eprint>
+3108>       cli     504(%r15),2
+3112>       mvghi   704(%r15),0
+3118>       jl      0x2aa000cd5b8 <libcoreinst::download::tests::test_write_image_limit+3480>
+3122>       lg      %r13,512(%r15)
+3128>       lg      %r1,8(%r13)
+3134>       lg      %r2,0(%r13)

I believe that cli 2 is comparing the Repr tag, where only Custom needs dropping. When I step through, that jl branch is taken to skip ahead. Otherwise, you can see that it would load 512(%r15) (offset 8 from before) for the Box into %r13, then proceed to dereference that pointer.

With nightly-2020-07-06, the assembly after printing is:

+4514>       brasl   %r14,0x2aa00349d40 <std::io::stdio::_eprint>
+4520>       cli     552(%r15),2
+4524>       llilh   %r1,16
+4528>       agr     %r1,%r15
+4532>       mvghi   768(%r1),0
+4538>       jl      0x2aa000cc2ce <libcoreinst::download::tests::test_write_image_limit+4718>
+4542>       lg      %r13,560(%r15)
+4548>       lg      %r1,8(%r13)
+4554>       lg      %r2,0(%r13)

Looks similar with cli 2, but I don't really speak SystemZ to guess what llilh and agr are doing. But in the debugger I see that this jl branch is not taken, and then it loads a garbage pointer into %r13 (always seems to be 0x21). The load of 8(%r13) is where we SIGSEGV.

@cuviper
Copy link
Member Author

cuviper commented Oct 3, 2020

Found the instruction reference...

  • LLILH is Load Logical Immediate (low high)
  • AGR is Add (64), Condition code set

So it seems to be clobbering the CLI condition.

I'm guessing the revised enum drops just exposed an existing codegen bug.

@ecstatic-morse
Copy link
Contributor

ecstatic-morse commented Oct 3, 2020

Nice!

Since I didn't want you to have all the fun, here is the LLVM IR for the relevant section. The first basic block is the comparison with the enum discriminant of io::Error and the second is the first piece of drop glue. The structure looks to be the exact same, but there are additional !noalias annotations in the one with the miscompilation. I think these are correct in a vacuum, but I don't have enough LLVM knowledge to say for sure. I certainly don't have enough to hypothesize why they caused the instruction scheduler to forget all about the need to preserve flags/condition codes.

2020-07-05:

bb60.i:                                           ; preds = %bb59.i
  call void @llvm.lifetime.end.p0i8(i64 48, i8* nonnull %456), !dbg !99333, !noalias !98802
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %383), !dbg !99333, !noalias !98802
  %464 = bitcast %"core::option::Option<alloc::vec::Vec<u8>>"* %signature.i to {}**, !dbg !99342
  store {}* null, {}** %464, align 8, !dbg !99342, !noalias !98802
  %465 = load i8, i8* %453, align 8, !dbg !99343, !range !15343, !noalias !98802
  %switch.i.i254.i = icmp ult i8 %465, 2, !dbg !99343
  br i1 %switch.i.i254.i, label %bb61.i, label %bb2.i.i.i235, !dbg !99343

bb2.i.i.i235:                                     ; preds = %bb60.i
  %466 = getelementptr inbounds %"std::io::error::Error", %"std::io::error::Error"* %err4.i, i64 0, i32 1, i32 2, i64 7, !dbg !99343
  %467 = bitcast i8* %466 to %"std::io::error::Custom"**, !dbg !99343
  %468 = load %"std::io::error::Custom"*, %"std::io::error::Custom"** %467, align 8, !dbg !99346, !noalias !98802, !nonnull !6
  %469 = bitcast %"std::io::error::Custom"* %468 to {}**, !dbg !99348
  %470 = load {}*, {}** %469, align 8, !dbg !99348, !nonnull !6
  %471 = getelementptr inbounds %"std::io::error::Custom", %"std::io::error::Custom"* %468, i64 0, i32 1, i32 1, !dbg !99348
  %472 = bitcast [3 x i64]** %471 to void ({}*)***, !dbg !99348
  %473 = load void ({}*)**, void ({}*)*** %472, align 8, !dbg !99348, !nonnull !6
  %474 = load void ({}*)*, void ({}*)** %473, align 8, !dbg !99348, !invariant.load !6, !nonnull !6
  invoke void %474({}* nonnull %470)
          to label %bb3.i.i.i.i.i255.i unwind label %cleanup.i.i.i.i.i.i238, !dbg !99348

2020-07-06:

bb63.i:                                           ; preds = %bb62.i
  call void @llvm.lifetime.end.p0i8(i64 48, i8* nonnull %413), !dbg !100148, !noalias !99592
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %335), !dbg !100148, !noalias !99592
  %421 = bitcast %"core::option::Option<alloc::vec::Vec<u8>>"* %signature.i to {}**, !dbg !100157
  store {}* null, {}** %421, align 8, !dbg !100157, !noalias !99592
  %422 = load i8, i8* %410, align 8, !dbg !100158, !range !19089, !noalias !99592
  %switch.i.i283.i = icmp ult i8 %422, 2, !dbg !100158
  br i1 %switch.i.i283.i, label %bb64.i, label %bb2.i.i.i, !dbg !100158

bb2.i.i.i:                                        ; preds = %bb63.i
  %423 = getelementptr inbounds %"std::io::error::Error", %"std::io::error::Error"* %err4.i, i64 0, i32 1, i32 2, i64 7, !dbg !100158
  %424 = bitcast i8* %423 to %"std::io::error::Custom"**, !dbg !100158
  %425 = load %"std::io::error::Custom"*, %"std::io::error::Custom"** %424, align 8, !dbg !100161, !noalias !99592, !nonnull !6
  %426 = bitcast %"std::io::error::Custom"* %425 to {}**, !dbg !100163
  %427 = load {}*, {}** %426, align 8, !dbg !100163, !noalias !99730, !nonnull !6
  %428 = getelementptr inbounds %"std::io::error::Custom", %"std::io::error::Custom"* %425, i64 0, i32 1, i32 1, !dbg !100163
  %429 = bitcast [3 x i64]** %428 to void ({}*)***, !dbg !100163
  %430 = load void ({}*)**, void ({}*)*** %429, align 8, !dbg !100163, !noalias !99730, !nonnull !6
  %431 = load void ({}*)*, void ({}*)** %430, align 8, !dbg !100163, !invariant.load !6, !noalias !99730, !nonnull !6
  invoke void %431({}* nonnull %427)
          to label %bb3.i.i.i.i.i285.i unwind label %cleanup.i.i.i.i.i287.i, !dbg !100163, !noalias !99730

@cuviper
Copy link
Member Author

cuviper commented Oct 3, 2020

there are additional !noalias annotations in the one with the miscompilation.

Box gets special treatment -- PointerKind::UniqueOwned -- did something about your patch affect that visibility?

I'm guessing, maybe before it was dropped indirectly through &mut (Shared pending #54878), but with the more precise drops it may be dropped by value? (UniqueOwned does get noalias.) Or if it used a raw *mut before (like drop_in_place), that doesn't get any PointerKind annotations. Of course the Drop trait uses &mut, but I don't know what it looks like when an implicit MIR drop is elaborated.

@ecstatic-morse
Copy link
Contributor

ecstatic-morse commented Oct 3, 2020

In the MIR for the "precise" version, we move the error variant into a local, then drop the local.

_63 = move ((_41 as Err).0)
...
drop(_63)

In the MIR for the imprecise version, we check the drop flag for the Err variant, then drop that projection.

drop(((_41 as Err).0))

As you alluded to, perhaps there's some difference in how we were emitting LLVM IR for drops of locals vs drops of projections? I'm not an expert on codegen. However, given that this miscompilation is present on the latest master even without "precise" drop elaboration, we must have eliminated this difference at some point. I'll check the LLVM IR for the latest nightly to be sure. Indeed, the noalias declarations persist even with precise drop elaboration disabled on a recent master:

#77423 -Zprecise-enum-drop-elaboration=no
bb63.i:                                           ; preds = %bb62.i
  call void @llvm.lifetime.end.p0i8(i64 48, i8* nonnull %460), !dbg !104817, !noalias !104241
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %353), !dbg !104817, !noalias !104241
  %468 = bitcast %"std::option::Option<std::vec::Vec<u8>>"* %signature.i to {}**, !dbg !104826
  store {}* null, {}** %468, align 8, !dbg !104826, !noalias !104241
  %469 = load i8, i8* %457, align 8, !dbg !104827, !range !3262, !noalias !104241
  %switch.i.i.i = icmp ult i8 %469, 2, !dbg !104827
  br i1 %switch.i.i.i, label %bb107.i, label %bb2.i.i283.i, !dbg !104827

bb2.i.i283.i:                                     ; preds = %bb63.i
  %470 = getelementptr inbounds %"std::io::Error", %"std::io::Error"* %err4.i, i64 0, i32 1, i32 2, i64 7, !dbg !104827
  %471 = bitcast i8* %470 to %"std::io::error::Custom"**, !dbg !104827
  %472 = load %"std::io::error::Custom"*, %"std::io::error::Custom"** %471, align 8, !dbg !104669, !noalias !104241, !nonnull !6
  %473 = bitcast %"std::io::error::Custom"* %472 to {}**, !dbg !104663
  %474 = load {}*, {}** %473, align 8, !dbg !104663, !noalias !104430, !nonnull !6
  %475 = getelementptr inbounds %"std::io::error::Custom", %"std::io::error::Custom"* %472, i64 0, i32 1, i32 1, !dbg !104663
  %476 = bitcast [3 x i64]** %475 to void ({}*)***, !dbg !104663
  %477 = load void ({}*)**, void ({}*)*** %476, align 8, !dbg !104663, !noalias !104430, !nonnull !6
  %478 = load void ({}*)*, void ({}*)** %477, align 8, !dbg !104663, !invariant.load !6, !noalias !104430, !nonnull !6
  invoke void %478({}* nonnull %474)
          to label %bb3.i.i.i.i.i284.i unwind label %bb44.i, !dbg !104663, !noalias !104430

One thing occurs to me now that didn't last night: These lines seem to be checking the vtable for the boxed trait object in io::Error::Custom:

 %429 = bitcast [3 x i64]** %428 to void ({}*)***, !dbg !100163
 %430 = load void ({}*)**, void ({}*)*** %429, align 8, !dbg !100163, !noalias !99730, !nonnull !6
 %431 = load void ({}*)*, void ({}*)** %430, align 8, !dbg !100163, !invariant.load !6, !noalias !99730, !nonnull !6

However, trait object vtables can definitely alias globally. I don't see how this kind of error would cause the miscompilation we're seeing, but as I said I don't have much LLVM expertise. Regardless, it seems like something we should fix?

@cuviper
Copy link
Member Author

cuviper commented Oct 3, 2020

However, trait object vtables can definitely alias globally.

The actual vtable should be shared read-only, which is fine for LLVM noalias.

@cuviper
Copy link
Member Author

cuviper commented Oct 5, 2020

Let's see what LLVM devs say: https://bugs.llvm.org/show_bug.cgi?id=47736

@Aaron1011
Copy link
Member

It looks like an LLVM patch was accepted: https://reviews.llvm.org/D89034

@cuviper
Copy link
Member Author

cuviper commented Oct 14, 2020

I'll backport that patch after the LLVM 11.0.0 final rebase lands -- #77948.
(I thought about doing that together, but figured it would be better to separate concerns.)

Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Oct 16, 2020
…nikic

llvm: backport SystemZ fix for AGR clobbers

Fixes rust-lang#77382.
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Oct 17, 2020
…nikic

llvm: backport SystemZ fix for AGR clobbers

Fixes rust-lang#77382.
@bors bors closed this as completed in b9c45d1 Oct 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-linux Operating system: Linux O-SystemZ Target: SystemZ processors (s390x) P-critical Critical priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants