Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasmtime: cache_accounts_for_opt_level failing on main #4600

Closed
fitzgen opened this issue Aug 3, 2022 · 3 comments · Fixed by #4609
Closed

Wasmtime: cache_accounts_for_opt_level failing on main #4600

fitzgen opened this issue Aug 3, 2022 · 3 comments · Fixed by #4609
Labels
bug Incorrect behavior in the current implementation that needs fixing

Comments

@fitzgen
Copy link
Member

fitzgen commented Aug 3, 2022

$ cargo test -p 'wasmtime' engine::tests::cache_accounts_for_opt_level
   Compiling wasmtime-environ v0.40.0 (/home/nick/wasmtime/crates/environ)
   Compiling wasmtime-runtime v0.40.0 (/home/nick/wasmtime/crates/runtime)
   Compiling wasmtime-cranelift v0.40.0 (/home/nick/wasmtime/crates/cranelift)
   Compiling wasmtime-jit v0.40.0 (/home/nick/wasmtime/crates/jit)
   Compiling wasmtime v0.40.0 (/home/nick/wasmtime/crates/wasmtime)
   Compiling wiggle v0.40.0 (/home/nick/wasmtime/crates/wiggle)
   Compiling wasi-common v0.40.0 (/home/nick/wasmtime/crates/wasi-common)
   Compiling wasi-cap-std-sync v0.40.0 (/home/nick/wasmtime/crates/wasi-common/cap-std-sync)
   Compiling wasmtime-wasi v0.40.0 (/home/nick/wasmtime/crates/wasi)
    Finished test [unoptimized + debuginfo] target(s) in 33.10s
     Running unittests src/lib.rs (target/debug/deps/wasmtime-c7705c65c4e36c46)

running 1 test
test engine::tests::cache_accounts_for_opt_level ... FAILED

failures:

---- engine::tests::cache_accounts_for_opt_level stdout ----
thread 'engine::tests::cache_accounts_for_opt_level' panicked at 'called `Result::unwrap()` on an `Err` value: ()', crates/jit/src/debug.rs:147:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    engine::tests::cache_accounts_for_opt_level

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 10 filtered out; finished in 0.03s

error: test failed, to rerun pass '-p wasmtime --lib'
@fitzgen fitzgen added the bug Incorrect behavior in the current implementation that needs fixing label Aug 3, 2022
@fitzgen
Copy link
Member Author

fitzgen commented Aug 3, 2022

This is my HEAD for main:

commit 301be7438e424a1bcaa5354ebe6d1b73b7f880ac (HEAD -> main, bytecodealliance/main)
Author: Trevor Elliott <[email protected]>
Date:   Wed Aug 3 13:28:52 2022 -0700

    x64: Begin migrating branch instructions to ISLE (#4587)
    
    https://github.com/bytecodealliance/wasmtime/pull/4587

@akirilov-arm
Copy link
Contributor

I enabled this test on AArch64 in #3606, so after seeing this issue I thought that I had caused a regression, but I just checked, and the test passed when I ran it via ci/run-tests.sh; after that I changed one of the test assertions to use a bogus value to verify that the test was actually running. Curiously, it fails when executed directly, as specified here.

@alexcrichton
Copy link
Member

Oh that explains why my bisection run on an AArch64 machine points to #3606 as the cause! A bisection on an x86_64 machine points to #4571 and while that is entirely unrelated I think I see the problem so I'll post a fix soon.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Aug 4, 2022
This commit fixes bytecodealliance#4600 in a somewhat roundabout fashion. Currently the
`main` branch of Wasmtime exhibits unusual behavior:

* If `./ci/run-tests.sh` is run then the `cache_accounts_for_opt_level`
  test does not fail.
* If `cargo test -p wasmtime --lib` is run, however, then the test
  fails.

This test is indeed being run as part of `./ci/run-tests.sh` and it's
also passing in CI. The exact failure is that part of the debuginfo
support we have takes an existing ELF image, copies it, and then appends
some information to inform profilers/gdb about the image. This code is
all quite old at this point and not 100% optimal, but that's at least
where we're at.

The problem is that the appended `ProgramHeader64` is not aligned
correctly during `cargo test -p wasmtime --lib`, which is the panic that
happens causing the test to fail. The reason, however, that this test
passes with `./ci/run-tests.sh` is that the alignment of
`ProgramHeader64` is 1 instead of 8. The reason for that is that the
`object` crate has an `unaligned` feature which forcibly unaligns all
primitives to 1 byte instead of their natural alignment. During `cargo
test -p wasmtime --lib` this feature is not enabled but during
`./ci/run-tests.sh` this feature is enabled. The feature is currently
enabled through inclusion of the `backtrace` crate which only happens
for some tests in some crates.

The alignment issue explains why the test fails on a single crate test
but fails on the whole workspace tests. The next issue I investigated
was if this test ever passed. It turns out that on v0.39.0 this test
passed, and the regression to main was introduced during bytecodealliance#4571. That
PR, however, has nothing to do with any of this! The reason that this
showed up as causing a "regression" however is because it changed
cranelift settings which changed the size of serialized metadata at the
end of a Wasmtime cache object.

Wasmtime compiled artifacts are ELF images with Wasmtime-specific
metadata appended after them. This appended metadata was making its way
all the way through to the gdbjit image itself which mean that while the
end of the ELF file itself was properly aligned the space after the
Wasmtime metadata was not aligned. This metadata changes in size over
time as Cranelift settings change which explains why bytecodealliance#4571 was the
"source" of the regression.

The fix in this commit is to discard the extra Wasmtime metadata when
creating an `MmapVec` representing the underlying ELF image. This is
already supported with `MmapVec::drain` so it was relatively easy to
insert that. This means that the gdbjit image starts with just the ELF
file itself which is always aligned at the end, which gets the test
passing with/without the `unaligned` feature in the `object` crate.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Aug 5, 2022
This commit fixes bytecodealliance#4600 in a somewhat roundabout fashion. Currently the
`main` branch of Wasmtime exhibits unusual behavior:

* If `./ci/run-tests.sh` is run then the `cache_accounts_for_opt_level`
  test does not fail.
* If `cargo test -p wasmtime --lib` is run, however, then the test
  fails.

This test is indeed being run as part of `./ci/run-tests.sh` and it's
also passing in CI. The exact failure is that part of the debuginfo
support we have takes an existing ELF image, copies it, and then appends
some information to inform profilers/gdb about the image. This code is
all quite old at this point and not 100% optimal, but that's at least
where we're at.

The problem is that the appended `ProgramHeader64` is not aligned
correctly during `cargo test -p wasmtime --lib`, which is the panic that
happens causing the test to fail. The reason, however, that this test
passes with `./ci/run-tests.sh` is that the alignment of
`ProgramHeader64` is 1 instead of 8. The reason for that is that the
`object` crate has an `unaligned` feature which forcibly unaligns all
primitives to 1 byte instead of their natural alignment. During `cargo
test -p wasmtime --lib` this feature is not enabled but during
`./ci/run-tests.sh` this feature is enabled. The feature is currently
enabled through inclusion of the `backtrace` crate which only happens
for some tests in some crates.

The alignment issue explains why the test fails on a single crate test
but fails on the whole workspace tests. The next issue I investigated
was if this test ever passed. It turns out that on v0.39.0 this test
passed, and the regression to main was introduced during bytecodealliance#4571. That
PR, however, has nothing to do with any of this! The reason that this
showed up as causing a "regression" however is because it changed
cranelift settings which changed the size of serialized metadata at the
end of a Wasmtime cache object.

Wasmtime compiled artifacts are ELF images with Wasmtime-specific
metadata appended after them. This appended metadata was making its way
all the way through to the gdbjit image itself which mean that while the
end of the ELF file itself was properly aligned the space after the
Wasmtime metadata was not aligned. This metadata changes in size over
time as Cranelift settings change which explains why bytecodealliance#4571 was the
"source" of the regression.

The fix in this commit is to discard the extra Wasmtime metadata when
creating an `MmapVec` representing the underlying ELF image. This is
already supported with `MmapVec::drain` so it was relatively easy to
insert that. This means that the gdbjit image starts with just the ELF
file itself which is always aligned at the end, which gets the test
passing with/without the `unaligned` feature in the `object` crate.
alexcrichton added a commit that referenced this issue Aug 10, 2022
This commit fixes #4600 in a somewhat roundabout fashion. Currently the
`main` branch of Wasmtime exhibits unusual behavior:

* If `./ci/run-tests.sh` is run then the `cache_accounts_for_opt_level`
  test does not fail.
* If `cargo test -p wasmtime --lib` is run, however, then the test
  fails.

This test is indeed being run as part of `./ci/run-tests.sh` and it's
also passing in CI. The exact failure is that part of the debuginfo
support we have takes an existing ELF image, copies it, and then appends
some information to inform profilers/gdb about the image. This code is
all quite old at this point and not 100% optimal, but that's at least
where we're at.

The problem is that the appended `ProgramHeader64` is not aligned
correctly during `cargo test -p wasmtime --lib`, which is the panic that
happens causing the test to fail. The reason, however, that this test
passes with `./ci/run-tests.sh` is that the alignment of
`ProgramHeader64` is 1 instead of 8. The reason for that is that the
`object` crate has an `unaligned` feature which forcibly unaligns all
primitives to 1 byte instead of their natural alignment. During `cargo
test -p wasmtime --lib` this feature is not enabled but during
`./ci/run-tests.sh` this feature is enabled. The feature is currently
enabled through inclusion of the `backtrace` crate which only happens
for some tests in some crates.

The alignment issue explains why the test fails on a single crate test
but fails on the whole workspace tests. The next issue I investigated
was if this test ever passed. It turns out that on v0.39.0 this test
passed, and the regression to main was introduced during #4571. That
PR, however, has nothing to do with any of this! The reason that this
showed up as causing a "regression" however is because it changed
cranelift settings which changed the size of serialized metadata at the
end of a Wasmtime cache object.

Wasmtime compiled artifacts are ELF images with Wasmtime-specific
metadata appended after them. This appended metadata was making its way
all the way through to the gdbjit image itself which mean that while the
end of the ELF file itself was properly aligned the space after the
Wasmtime metadata was not aligned. This metadata changes in size over
time as Cranelift settings change which explains why #4571 was the
"source" of the regression.

The fix in this commit is to discard the extra Wasmtime metadata when
creating an `MmapVec` representing the underlying ELF image. This is
already supported with `MmapVec::drain` so it was relatively easy to
insert that. This means that the gdbjit image starts with just the ELF
file itself which is always aligned at the end, which gets the test
passing with/without the `unaligned` feature in the `object` crate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants