Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: Use a fixpoint loop to compute the best value for each eclass #7859

Merged
merged 7 commits into from
Feb 5, 2024

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Feb 2, 2024

Fixes #7857

Copy link
Member

@elliottt elliottt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

cranelift/codegen/src/egraph/cost.rs Show resolved Hide resolved
cranelift/codegen/src/egraph/elaborate.rs Show resolved Hide resolved
cranelift/codegen/src/egraph/elaborate.rs Show resolved Hide resolved
@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Feb 2, 2024

for (value, def) in self.func.dfg.values_and_defs() {
// If the cost of this value is finite, then we've already found
// its final cost.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the drive-by from the sidelines, just a possible clarification request here though after thinking about cost updates during a long drive today:

It's not immediately obvious to me why this (once finite, then final) property is the case; I'm curious what reasoning y'all have gone through on this and/or what you've observed? I think a node's cost can continue to decrease as we discover more finite costs (consider a union node: min(20, infinity) == 20 in first pass, min(20, 10) == 10 in second pass; then another node that uses that as an arg). Or is there an argument we can make why this shouldn't happen in practice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point Chris! When @fitzgen and I were discussing the fixpoint change yesterday, we reasoned that it was okay to skip finite values because we were assuming two things:

  • We would remove the behavior where Cost addition would saturate to MAX_COST, not infinity()
  • As we can't produce cycles, a fixpoint would cause everything to eventually settle out to finite cost

As you pointed out, the flaw with this reasoning is that the handling of Union values will not behave this way, instead preferring finite values to infinite.

Since addition now saturates to infinity which will ensure that Result nodes don't appear finite until all their dependencies have been processed, what do you think about only computing the min if both arguments to a Union are finite? I think that change would make more concrete our use of the infinity() cost: it's a marker for where all the arguments have not yet been processed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for a union(a, b) to be finite but not in its final form would require one of a or b to finite and the other infinite, but the only way we can still have an infinite cost for an operand value when computing the cost of the current value is if the operand value's index is larger than the current value's index. That cannot happen for union values, since they are only added to the DFG after their operands.

This is, however, a pretty subtle argument, so I'd be fine skipping this early-continue optimization. I'll land this PR without it, because that is pretty obviously correct, and if we want to experiment with different approaches to optimizing the loop from there, we can open follow up PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point Nick, sorry for muddying the waters there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay actually I was wrong, thanks Trevor for asking very pointed questions in private chat :-p

The union's operand values are always defined before the union, but if one of those operand values is a funky one where its operands are out of order, then the operand could still be infinite by the time we get to the union, and then the union's min would drop the infinite. That would be a finite cost that is potentially not in its final form, depending on the cost we still need to compute for the still-infinite operand.

So this "optimization" of early-continuing was not correct! Bullet dodged.

This ae-graphs code is all very subtle, and we should spend some time thinking about what we can do to make things more obviously correct, even if it is just adding additional debug asserts and comments. It shouldn't take 3.5 engineers who are all intimately familiar with Cranelift a full day to diagnose and fix this kind of bug and still introduce subtle flaws in the fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for clarity: since we are doing the "full" fixpoint now, even if we "drop" an operand's infinite cost via min in one iteration of the loop, we will consider that operand's value again on the next iteration of the fix point, and eventually, as the fixpoint is reached, we will have the correct costs for everything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fitzgen and @elliottt (and @alexcrichton) for taking this on and sorry for not realizing this subtle case originally!

A further optimization (which I can take on when I'm back) that occurred to me today: we could track whether we see any "forward references" (perhaps integrate this into the fixpoint loop itself, though it won't change between iterations), and exit the loop after one iteration if none exist. This is the common case, and it would avoid doing a second (no-changes) pass. This extra cost is totally fine for now IMHO (correctness first!).

I agree the code is pretty subtle; to some degree I think that's inherent to the problem, and it's already pretty comment-dense in many (not all!) areas, but I can also try to add some more top-level documentation on invariants and the like when I'm back. I'd like to try to do some more semi-formal proofs too, similar to MachBuffer's comments, to convince us that we don't have any more issues lurking (and to help understanding).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and not trying to point fingers or anything, just trying to improve the situation for everyone. I think something like #7856 would help a lot too.

@fitzgen fitzgen enabled auto-merge February 2, 2024 16:01
@fitzgen fitzgen added this pull request to the merge queue Feb 2, 2024
@elliottt elliottt removed this pull request from the merge queue due to a manual request Feb 2, 2024
@fitzgen fitzgen added this pull request to the merge queue Feb 2, 2024
@fitzgen
Copy link
Member Author

fitzgen commented Feb 2, 2024

(Re-adding to merge queue after misunderstanding regarding #7859 (comment))

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 2, 2024
@fitzgen fitzgen force-pushed the egraph-cost-fix-point branch from 6eced83 to 370fb43 Compare February 5, 2024 17:12
@fitzgen fitzgen enabled auto-merge February 5, 2024 17:13
@fitzgen fitzgen added this pull request to the merge queue Feb 5, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 5, 2024
@alexcrichton
Copy link
Member

Given that the same riscv64 failure happened twice in a row my guess is that it's probably a deterministic failure rather than a spurious failure. That may mean that a preexisting riscv64 lowering rule is buggy and this is starting to expose that. I'll note though that I haven't attempted to reproduce locally yet.

@alexcrichton
Copy link
Member

Ah yes I can reproduce locally:

---- wasi_http_hash_all_with_override stdout ----
thread 'wasi_http_hash_all_with_override' panicked at cranelift/codegen/src/egraph/elaborate.rs:296:17:
assertion failed: best[value].0.is_finite()

---- wasi_http_double_echo stdout ----
thread 'wasi_http_double_echo' panicked at cranelift/codegen/src/egraph/elaborate.rs:296:17:
assertion failed: best[value].0.is_finite()

---- wasi_http_hash_all stdout ----
thread 'wasi_http_hash_all' panicked at cranelift/codegen/src/egraph/elaborate.rs:296:17:
assertion failed: best[value].0.is_finite()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- wasi_http_echo stdout ----
thread 'wasi_http_echo' panicked at cranelift/codegen/src/egraph/elaborate.rs:296:17:
assertion failed: best[value].0.is_finite()

No output on CI due to rayon-rs/rayon#1066 I think, not that it's actually a bug in rayon but an unfortunate consequence.

@afonso360
Copy link
Contributor

I ran the fuzzgen-icache fuzzer to try and find a small reproducible example for the riscv bug, but it found a similar error for s390x:

test compile
set opt_level=speed
target s390x

function u1:0() -> f32x4 system_v {
    const0 = 0x00000000000000000000000000000000

block0:
    v27 = vconst.f32x4 const0
    v57 = fma v27, v27, v27  ; v27 = const0, v27 = const0, v27 = const0
    v58 = vconst.i32x4 const0
    v60 = vconst.f32x4 const0
    v61 = bitcast.f32x4 v58  ; v58 = const0
    v28 = bitselect v61, v60, v57  ; v60 = const0
    v62 = fma v28, v28, v28
    v63 = fcmp ne v62, v62
    v65 = vconst.f32x4 const0
    v66 = bitcast.f32x4 v63
    v29 = bitselect v66, v65, v62  ; v65 = const0
    v67 = fma v29, v29, v29
    v68 = fcmp ne v67, v67
    v70 = vconst.f32x4 const0
    v71 = bitcast.f32x4 v68
    v30 = bitselect v71, v70, v67  ; v70 = const0
    v72 = fma v30, v30, v30
    v73 = fcmp ne v72, v72
    v75 = vconst.f32x4 const0
    v76 = bitcast.f32x4 v73
    v31 = bitselect v76, v75, v72  ; v75 = const0
    v77 = fma v31, v31, v31
    v78 = fcmp ne v77, v77
    v80 = vconst.f32x4 const0
    v81 = bitcast.f32x4 v78
    v32 = bitselect v81, v80, v77  ; v80 = const0
    v82 = fma v32, v32, v32
    v83 = fcmp ne v82, v82
    v85 = vconst.f32x4 const0
    v86 = bitcast.f32x4 v83
    v33 = bitselect v86, v85, v82  ; v85 = const0
    v87 = fma v33, v33, v33
    v88 = fcmp ne v87, v87
    v90 = vconst.f32x4 const0
    v91 = bitcast.f32x4 v88
    v34 = bitselect v91, v90, v87  ; v90 = const0
    return v34
}

I'm still going to try to find a smaller one before trying to figure out which rule is causing issues

@fitzgen
Copy link
Member Author

fitzgen commented Feb 5, 2024

Thanks Afonso!

@afonso360
Copy link
Contributor

Here's another case that it found, this one for AArch64.

Testcase
test compile
set opt_level=speed
target aarch64

function u1:0(f64x2, f64x2) -> f64x2, f64x2 tail {
    sig0 = (f64x2, f64x2) -> f64x2, f64x2 tail
    fn0 = colocated u2:0 sig0

block0(v0: f64x2, v1: f64x2):
    v2 = iconst.i8 0
    v3 = iconst.i16 0
    v4 = iconst.i32 0
    v5 = iconst.i64 0
    v6 = uextend.i128 v5  ; v5 = 0
    v7 = func_addr.i64 fn0
    return_call_indirect sig0, v7(v1, v1)

block1 cold:
    v62 = f64const 0.0
    v63 = splat.f64x2 v62  ; v62 = 0.0
    v9, v10 = call fn0(v63, v63)
    v11, v12 = call fn0(v10, v10)
    v13, v14 = call fn0(v12, v12)
    v15, v16 = call fn0(v14, v14)
    v17, v18 = call fn0(v16, v16)
    v19, v20 = call fn0(v18, v18)
    v21, v22 = call fn0(v20, v20)
    v23, v24 = call fn0(v22, v22)
    v25, v26 = call fn0(v24, v24)
    v27, v28 = call fn0(v26, v26)
    v29, v30 = call fn0(v28, v28)
    v31, v32 = call fn0(v30, v30)
    v33, v34 = call fn0(v32, v32)
    v35, v36 = call fn0(v34, v34)
    v37, v38 = call fn0(v36, v36)
    v39, v40 = call fn0(v38, v38)
    v41, v42 = call fn0(v40, v40)
    v43, v44 = call fn0(v42, v42)
    v45, v46 = call fn0(v44, v44)
    v47, v48 = call fn0(v46, v46)
    v49, v50 = call fn0(v48, v48)
    return v49, v49
}

This one is interesting to me because almost all of this is dead code, but if we minimize it, it no longer crashes 👀 . The trace log states the following:

 TRACE cranelift_codegen::context              > About to optimize with egraph phase:
function u1:0(f64x2, f64x2) -> f64x2, f64x2 tail {
    sig0 = (f64x2, f64x2) -> f64x2, f64x2 tail
    fn0 = colocated u2:0 sig0

block0(v0: f64x2, v1: f64x2):
    v7 = func_addr.i64 fn0
    return_call_indirect sig0, v7(v1, v1)
}

So it does optimize away the deadcode internally, but then still tries to elaborate some of the previously eliminated instructions. Which doesn't make sense to me, but I haven't kept up with the inner workings of the egraphs stuff.

I'm not familiar enough with egraphs to be able to debug this, but if you need any help reworking one of the lowering rules let me know!

Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just one optional suggestion.

cranelift/codegen/src/egraph/cost.rs Show resolved Hide resolved
@fitzgen fitzgen enabled auto-merge February 5, 2024 22:31
@fitzgen fitzgen added this pull request to the merge queue Feb 5, 2024
Merged via the queue into bytecodealliance:main with commit 5b2ae83 Feb 5, 2024
19 checks passed
@fitzgen fitzgen deleted the egraph-cost-fix-point branch February 5, 2024 23:22
fitzgen added a commit to fitzgen/wasmtime that referenced this pull request Feb 6, 2024
…ass (bytecodealliance#7859)

* Cranelift: Use a fixpoint loop to compute the best value for each eclass

Fixes bytecodealliance#7857

* Remove fixpoint loop early-continue optimization

* Add document describing optimization rule invariants

* Make select optimizations use subsume

* Remove invalid debug assert

* Remove now-unused methods

* Add commutative adds to cost tests
fitzgen added a commit that referenced this pull request Feb 6, 2024
…ass (#7859) (#7878)

* Cranelift: Use a fixpoint loop to compute the best value for each eclass

Fixes #7857

* Remove fixpoint loop early-continue optimization

* Add document describing optimization rule invariants

* Make select optimizations use subsume

* Remove invalid debug assert

* Remove now-unused methods

* Add commutative adds to cost tests
elliottt pushed a commit to elliottt/wasmtime that referenced this pull request Feb 7, 2024
…ass (bytecodealliance#7859)

* Cranelift: Use a fixpoint loop to compute the best value for each eclass

Fixes bytecodealliance#7857

* Remove fixpoint loop early-continue optimization

* Add document describing optimization rule invariants

* Make select optimizations use subsume

* Remove invalid debug assert

* Remove now-unused methods

* Add commutative adds to cost tests
elliottt added a commit that referenced this pull request Feb 7, 2024
* Guard recursion in `will_simplify_with_ireduce` (#7882)

Add a test to expose issues with unbounded recursion through `iadd`
during egraph rewrites, and bound the recursion of
`will_simplify_with_ireduce`.

Fixes #7874

Co-authored-by: Nick Fitzgerald <[email protected]>

* Cranelift: Use a fixpoint loop to compute the best value for each eclass (#7859)

* Cranelift: Use a fixpoint loop to compute the best value for each eclass

Fixes #7857

* Remove fixpoint loop early-continue optimization

* Add document describing optimization rule invariants

* Make select optimizations use subsume

* Remove invalid debug assert

* Remove now-unused methods

* Add commutative adds to cost tests

* Add missing subsume uses in egraph rules (#7879)

* Fix a few egraph rules that needed `subsume`

There were a few rules that dropped value references from the LHS
without using subsume. I think they were probably benign as they
produced constant results, but this change is in the spirit of our
revised guidelines for egraph rules.

* Augment egraph rule guideline 2 to talk about constants

* Update release notes

---------

Co-authored-by: Nick Fitzgerald <[email protected]>
elliottt added a commit that referenced this pull request Feb 7, 2024
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 12, 2024
This commit is born out of a fuzz bug on x64 that was discovered recently.
Today, on `main`, and in the 17.0.1 release Wasmtime will panic when compiling
this wasm module for x64:

    (module
      (func (result v128)
        i32.const 0
        i32x4.splat
        f64x2.convert_low_i32x4_u))

panicking with:

    thread '<unnamed>' panicked at /home/alex/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cranelift-codegen-0.104.1/src/machinst/lower.rs:766:21:
    should be implemented in ISLE: inst = `v6 = fcvt_from_uint.f64x2 v13  ; v13 = const0`, type = `Some(types::F64X2)`
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Bisections points to the "cause" of this regression as bytecodealliance#7859 which
more-or-less means that this has always been an issue and that PR just
happened to expose the issue. What's happening here is that egraph
optimizations are turning the IR into a form that the x64 backend can't
codegen. Namely there's no general purpose lowering of i64x2 being
converted to f64x2. The Wasm frontend never produces this but the
optimizations internally end up producing this.

Notably here the result of this function is constant and what's
happening is that a convert-of-a-splat is happening. In lieu of adding
the full general lowering to x64 (which is perhaps overdue since this is
the second or third time this panic has been triggered) I've opted to
add constant propagation optimizations for int-to-float conversions.
These are all based on the Rust `as` operator which has the same
semantics as Cranelift. This is enough to fix the issue here for the
time being.
github-merge-queue bot pushed a commit that referenced this pull request Feb 12, 2024
This commit is born out of a fuzz bug on x64 that was discovered recently.
Today, on `main`, and in the 17.0.1 release Wasmtime will panic when compiling
this wasm module for x64:

    (module
      (func (result v128)
        i32.const 0
        i32x4.splat
        f64x2.convert_low_i32x4_u))

panicking with:

    thread '<unnamed>' panicked at /home/alex/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cranelift-codegen-0.104.1/src/machinst/lower.rs:766:21:
    should be implemented in ISLE: inst = `v6 = fcvt_from_uint.f64x2 v13  ; v13 = const0`, type = `Some(types::F64X2)`
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Bisections points to the "cause" of this regression as #7859 which
more-or-less means that this has always been an issue and that PR just
happened to expose the issue. What's happening here is that egraph
optimizations are turning the IR into a form that the x64 backend can't
codegen. Namely there's no general purpose lowering of i64x2 being
converted to f64x2. The Wasm frontend never produces this but the
optimizations internally end up producing this.

Notably here the result of this function is constant and what's
happening is that a convert-of-a-splat is happening. In lieu of adding
the full general lowering to x64 (which is perhaps overdue since this is
the second or third time this panic has been triggered) I've opted to
add constant propagation optimizations for int-to-float conversions.
These are all based on the Rust `as` operator which has the same
semantics as Cranelift. This is enough to fix the issue here for the
time being.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Block order and value number affects whether we get valid CLIF after optimizations
6 participants