Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks #8

Open
Robbepop opened this issue Feb 20, 2023 · 54 comments
Open

Benchmarks #8

Robbepop opened this issue Feb 20, 2023 · 54 comments

Comments

@Robbepop
Copy link
Contributor

Hi @yamt ,

it is really cool that you put so many Wasm runtimes on your benchmarks for comparison!
I have a few questions though.
What hardware did you run the benchmarks on? Would be cool if you could write that down somewhere for reproducibility.
Also I saw that wasmi is included in the script but not showing in the README.md.
Did wasmi not work? If it works I'd be interested in the numbers on your machine. :)
Unfortunately running the benchmark script requires quite a setup.

@yamt
Copy link
Owner

yamt commented Feb 21, 2023

Hi @yamt ,

it is really cool that you put so many Wasm runtimes on your benchmarks for comparison!

thank you.
but honestly speaking i feel i added too many runtimes.
the purpose of the benchmark was to give a rough idea about toywasm performance.
a few runtimes were enough.

I have a few questions though. What hardware did you run the benchmarks on? Would be cool if you could write that down somewhere for reproducibility. Also I saw that wasmi is included in the script but not showing in the README.md. Did wasmi not work? If it works I'd be interested in the numbers on your machine. :) Unfortunately running the benchmark script requires quite a setup.

i run it on my macbook. (MBP 15-inch 2018)
the latest wasmi works. (at least to complete the benchmark)
it will be included when i happen to run it again.

@Robbepop
Copy link
Contributor Author

but honestly speaking i feel i added too many runtimes.

Yeah maybe you did. I think this benchmark with so many different runtimes could even be extended into its own project/repo.

Btw. in order to highlight the strength of the approach you took in your toywasm you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.

@yamt
Copy link
Owner

yamt commented Feb 21, 2023

it will be included when i happen to run it again.

done. (actually i just pushed an unpushed result i found in local repo.)

@yamt
Copy link
Owner

yamt commented Feb 21, 2023

but honestly speaking i feel i added too many runtimes.

Yeah maybe you did. I think this benchmark with so many different runtimes could even be extended into its own project/repo.

maybe. i myself am not interested in maintaining such a thing right now though.

Btw. in order to highlight the strength of the approach you took in your toywasm you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.

thank you for your insights. i agree.

@yamt
Copy link
Owner

yamt commented May 7, 2023

Btw. in order to highlight the strength of the approach you took in your toywasm you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.

thank you for your insights. i agree.

i added a benchmark about startup time and memory consumption.
https://github.com/yamt/toywasm/blob/master/benchmark/startup.md

@Robbepop
Copy link
Contributor Author

Robbepop commented May 7, 2023

Thank you for those benchmarks!
They are really insightful. Seems like there is some room for improvement for wasmi.
Your toywasm looks very strong! :)

@Robbepop
Copy link
Contributor Author

@yamt thanks to your memory consumption benchmarks I took a better look at wasmi's internal bytecode and was able to make it more space efficient while keeping the original performance or even slightly boosting it for version 0.30.0. This also improved translation time (startup time). Thanks again for that initial spark! :)

@yamt
Copy link
Owner

yamt commented May 29, 2023

@yamt thanks to your memory consumption benchmarks I took a better look at wasmi's internal bytecode and was able to make it more space efficient while keeping the original performance or even slightly boosting it for version 0.30.0. This also improved translation time (startup time). Thanks again for that initial spark! :)

it's good to hear! thank you for letting me know.

noted for the next run of the benchmark.
(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)

@Robbepop
Copy link
Contributor Author

Robbepop commented May 29, 2023

noted for the next run of the benchmark.

looking forward :)

(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)

Oh wow, that's super interesting news since toywasm is also an interpreter just like wasmi and I always decided against implementing SIMD in wasmi since I felt it would just slow down the entire interpreter for not too many actual gains. However, proper benchmarks to verify or disproof this "feeling" is always the best! :)

Having a Wasm runtime that is up to date with the standardised proposals is obviously very nice.

@yamt
Copy link
Owner

yamt commented May 29, 2023

noted for the next run of the benchmark.

looking forward :)

(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)

Oh wow, that's super interesting news since toywasm is also an interpreter just like wasmi and I always decided against implementing SIMD in wasmi since I felt it would just slow down the entire interpreter for not too many actual gains. However, proper benchmarks to verify or disproof this "feeling" is always the best! :)

Having a Wasm runtime that is up to date with the standardised proposals is obviously very nice.

i have a similar feeling. but i added it mainly for completeness.

having said that, these "do more work per instruction" style instructions can be rather friendly to
interpreters like toywasm because it can hide instruction-fetch-parse overhead.

@Robbepop
Copy link
Contributor Author

having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.

At Parity we even found that generating Wasm from Rust source is slightly smaller when enabling Wasm SIMD which is obviously great since translation time can be significant for some practical use cases and it usually linearly scales with Wasm blob size.

I assume you used 64-bit cells for the value stack before the introduction of SIMD to toywasm. If that's the case, have you simply increased cell size to 128-bit to fit 128-bit vectors from SIMD or do those 128-bit vectors not occupy 2 cells. The latter design is probably more complex but would probably result in fewer regressions of non-SIMD instruction execution and memory consumption overall.

@yamt
Copy link
Owner

yamt commented May 30, 2023

having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.

At Parity we even found that generating Wasm from Rust source is slightly smaller when enabling Wasm SIMD which is obviously great since translation time can be significant for some practical use cases and it usually linearly scales with Wasm blob size.

i guess it uses llvm as a backend?
llvm seems to use simd instructions in some interesting ways.

I assume you used 64-bit cells for the value stack before the introduction of SIMD to toywasm. If that's the case, have you simply increased cell size to 128-bit to fit 128-bit vectors from SIMD or do those 128-bit vectors not occupy 2 cells. The latter design is probably more complex but would probably result in fewer regressions of non-SIMD instruction execution and memory consumption overall.

in toywasm, the value stack cell size depends on build-time configurations.

before simd, there were two configurations:

  • 32-bit cell, i64 uses two cells. (default)
  • 64-bit cell, any values use a cell. (faster in many cases. i suppose wasmi UntypedValue works similarly to this.)

after simd, there are three:

  • 32-bit cell, i64 uses two cells, v128 uses four cells. (still default)
  • 64-bit cell (when simd is disabled)
  • 128-bit cell (when simd is enabled)

@Robbepop
Copy link
Contributor Author

Robbepop commented May 30, 2023

in toywasm, the value stack cell size depends on build-time configurations.

ah, that's very interesting and perfect for research about what cell size is the best for which use case. :) how much complexity did this add to the interpreter compared to having for example fixed 64-bit cell sizes?

i suppose wasmi UntypedValue works similarly to this.

yes, that's correct.

very interesting approach and looking forward to all the results that you are going to pull off of this. :)

in the past I have been using wasm-coremark to test basic performance of computations for wasmi in comparison with Wasmtime and Wasm3 using https://github.com/Robbepop/wasm-coremark-rs/tree/rf-update-vms-v2.

however, this is a rather artificial benchmark and probably less ideal than your ffmpeq and spidermonkey testcases. what I found out is that the runtime performance of all 3 Wasm runtimes were extremely dependent on the underlying hardware. For example, wasmi performs quite okay on Intel CPUs and super poorly on M1/2 whereas both wasmi and Wasm3 furthermore perform pretty badly on AMD CPUs. And Wasmtime performs way better on AMD CPUs than on Intel or M1/2.

Given these severe differences I think it is kinda important to tag your own benchmark results for reproducibility with the hardware (mainly CPU) and OS used.

@yamt
Copy link
Owner

yamt commented May 30, 2023

in toywasm, the value stack cell size depends on build-time configurations.

ah, that's very interesting and perfect for research about what cell size is the best for which use case. :) how much complexity did this add to the interpreter compared to having for example fixed 64-bit cell sizes?

originally toywasm was using fixed 64-bit cells.
later i added TOYWASM_USE_SMALL_CELLS option to use small (32-bit) cells.
you can see the code blocks ifdef'ed on this macro to see how much complexity is involved.

besides that, i introduced TOYWASM_USE_RESULTTYPE_CELLIDX and TOYWASM_USE_LOCALTYPE_CELLIDX to speed up by-index value stack accesses like local.get.
(when using small cells, local.get imm somehow needs to be converted to the corresponding location of the cell(s).)
you can consider them as a part of TOYWASM_USE_SMALL_CELLS as well.

i suppose that it can be simpler for "translating" interpreters like wasmi because
you can embed many of these pre-calculated information into the translated internal opcodes themselves.

i suppose wasmi UntypedValue works similarly to this.

yes, that's correct.

very interesting approach and looking forward to all the results that you are going to pull off of this. :)

in the past I have been using wasm-coremark to test basic performance of computations for wasmi in comparison with Wasmtime and Wasm3 using https://github.com/Robbepop/wasm-coremark-rs/tree/rf-update-vms-v2.

however, this is a rather artificial benchmark and probably less ideal than your ffmpeq and spidermonkey testcases. what I found out is that the runtime performance of all 3 Wasm runtimes were extremely dependent on the underlying hardware. For example, wasmi performs quite okay on Intel CPUs and super poorly on M1/2 whereas both wasmi and Wasm3 furthermore perform pretty badly on AMD CPUs. And Wasmtime performs way better on AMD CPUs than on Intel or M1/2.

Given these severe differences I think it is kinda important to tag your own benchmark results for reproducibility with the hardware (mainly CPU) and OS used.

interesting. i haven't thought about cpu differences much.
all my benchmarks are with:

ProductName:    macOS
ProductVersion: 12.6.5
BuildVersion:   21G531
MacBook Pro (15-inch, 2018)
2.2 GHz 6-Core Intel Core i7

yamt added a commit that referenced this issue May 30, 2023
yamt added a commit that referenced this issue May 30, 2023
@Robbepop
Copy link
Contributor Author

What just crossed my mind about cell sizes and SIMD support is the following: Maybe it is practical and efficient to have 2 different stacks, e.g. one stack with 64-bit cells and another stack with 128-bit cells. Both stacks are used simultaneously (push, pop) but exclusively for non-SIMD and SIMD instructions respectively. Due to Wasm validation phase and type checks it should probably be possible to support SIMD without touching the already existing stack and not using this 128-bit cell stack at all (and thus not affecting non-SIMD code) when not using SIMD instructions.

Maybe I am overlooking something here. Although if this was efficient I assume it might introduce less complexity than different cell sizes or having SIMD instructions use 2 cells instead of 1. I am way into speculation here. Implementation/time needed to confirm haha.

@yamt
Copy link
Owner

yamt commented May 31, 2023

What just crossed my mind about cell sizes and SIMD support is the following: Maybe it is practical and efficient to have 2 different stacks, e.g. one stack with 64-bit cells and another stack with 128-bit cells. Both stacks are used simultaneously (push, pop) but exclusively for non-SIMD and SIMD instructions respectively. Due to Wasm validation phase and type checks it should probably be possible to support SIMD without touching the already existing stack and not using this 128-bit cell stack at all (and thus not affecting non-SIMD code) when not using SIMD instructions.

Maybe I am overlooking something here. Although if this was efficient I assume it might introduce less complexity than different cell sizes or having SIMD instructions use 2 cells instead of 1. I am way into speculation here. Implementation/time needed to confirm haha.

it's an interesting idea.
random thoughts:

  • i guess you can even separate stack for 32-bit and 64-bit.
  • v128 values with 128-bit alignment is a nice property for at least certain cpus.
  • function parameters/results might be a bit tricky to implement with the approach.
  • the height for each stacks need to be tracked for possible unwinding. (eg. br)
  • i'm not sure if it's less or more complex as a whole.

@yamt
Copy link
Owner

yamt commented May 31, 2023

noted for the next run of the benchmark.

looking forward :)

i rerun the benchmarks:
https://github.com/yamt/toywasm/blob/master/benchmark/ffmpeg.md
https://github.com/yamt/toywasm/blob/master/benchmark/startup.md

wasmi has been improved a lot since the last time. (0.27.0)
good work!

@Robbepop
Copy link
Contributor Author

Awesome work @yamt and thanks a ton for those benchmarks! 🚀

I am especially fond of the fact that there is nearly no difference between toywasm (SIMD) and toywasm (no SIMD) so maybe fixed 128-bit cells are the way to go and not at all super terrible? 🤔 Obviously they consume a bit more memory but even that difference isn't all too significant imo.

Looks like a very successful research conclusion to me for your SIMD implementation in toywasm! :)

Concerning wasmi performance: The optimizations I have implemented lately cannot explain this extreme difference so I rather think that maybe the wasmi 0.27.0 version got released without proper optimizations enabled. Unfortunately there is a bug in Cargo (the build tool) that requires manual handling for this to happen and sometimes I forget about this when releasing. 🙈 But still the startup time improvement is quite nice. :)

@yamt
Copy link
Owner

yamt commented May 31, 2023

Awesome work @yamt and thanks a ton for those benchmarks! 🚀

I am especially fond of the fact that there is nearly no difference between toywasm (SIMD) and toywasm (no SIMD) so maybe fixed 128-bit cells are the way to go and not at all super terrible? 🤔 Obviously they consume a bit more memory but even that difference isn't all too significant imo.

Looks like a very successful research conclusion to me for your SIMD implementation in toywasm! :)

i guess ffmpeg.wasm (or, probably any C programs) is rather linear-memory intensive than value stack.

Concerning wasmi performance: The optimizations I have implemented lately cannot explain this extreme difference so I rather think that maybe the wasmi 0.27.0 version got released without proper optimizations enabled. Unfortunately there is a bug in Cargo (the build tool) that requires manual handling for this to happen and sometimes I forget about this when releasing. 🙈 But still the startup time improvement is quite nice. :)

hmm. wrt 0.27.0, it might be an error on my side.
i manually built both versions of wasmi locally as:
https://github.com/yamt/toywasm/blob/master/benchmark/notes.md#wasmi

@Robbepop
Copy link
Contributor Author

Robbepop commented May 31, 2023

Ah I thought your were simply installing wasmi via cargo install wasmi_cli.
The Cargo bug that not all optimizations are properly applied mostly affects certain binaries installed via cargo install. For wasmi version 0.30.0 I made sure the optimizations are applied when installing via cargo install. :)

wasmi is heavily depending on lto="fat" as well as codegen-units=1 optimization configs. Without them wasmi performance easily drops by 100% in some cases (or even more in others).
I just checked the wasmi Cargo.toml and it seems that if you are building wasmi like this then these optimizations are almost certainly not applied. I should probably change the default --release build here but I was not expecting people to build wasmi from sources. My fault. The default is without those optimizations enabled since they significantly increase the build time of wasmi so I usually only enable them for benchmarks or releases.

@yamt
Copy link
Owner

yamt commented May 31, 2023

Ah I thought your were simply installing wasmi via cargo install wasmi_cli. The Cargo bug that not all optimizations are properly applied mostly affects certain binaries installed via cargo install. For wasmi version 0.30.0 I made sure the optimizations are applied when installing via cargo install. :)

things like cargo install go install etc scare me a bit. :-)

wasmi is heavily depending on lto="fat" as well as codegen-units=1 optimization configs. Without them wasmi performance easily drops by 100% in some cases (or even more in others). I just checked the wasmi Cargo.toml and it seems that if you are building wasmi like this then these optimizations are almost certainly not applied. I should probably change the default --release build here but I was not expecting people to build wasmi from sources. My fault. The default is without those optimizations enabled since they significantly increase the build time of wasmi so I usually only enable them for benchmarks or releases.

it reminded me that, while i wanted to use lto=full for toywasm, cmake insisted to use lto=thin.

# Note: -flto=full seems to produce considerably faster code
# than flto=thin for us. However, cmake INTERPROCEDURAL_OPTIMIZATION
# always use -flto=thin for clang.
# cf. https://gitlab.kitware.com/cmake/cmake/-/issues/16808
set_property(TARGET toywasm-lib-core PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE)

in the meantime, i added a warning about this: 9ee47bf

@Robbepop
Copy link
Contributor Author

Robbepop commented Jun 1, 2023

If you want to benchmark wasmi with full optimizations and build it from sources you can build it via:

cargo build --profile bench

So instead of --release you do --profile bench. :)
Made a pull request to your benchmark docs so it is documented: #39

Ever thought of writing a blog post with all your benchmarks about Wasm runtimes? :D Seems like you could pull off quite a bit of information there.

@yamt
Copy link
Owner

yamt commented Jun 1, 2023

If you want to benchmark wasmi with full optimizations and build it from sources you can build it via:

cargo build --profile bench

So instead of --release you do --profile bench. :) Made a pull request to your benchmark docs so it is documented: #39

thank you. i commented in the PR.

Ever thought of writing a blog post with all your benchmarks about Wasm runtimes? :D Seems like you could pull off quite a bit of information there.

i have no interest in blog right now.

@yamt
Copy link
Owner

yamt commented Jul 23, 2023

If you want to benchmark wasmi with full optimizations and build it from sources you can build it via:

cargo build --profile bench

i rerun with it and pushed the results.

i also updated the procedure for wasmer. (it was not clearing cache as i intended.)

@Robbepop
Copy link
Contributor Author

Robbepop commented Jul 23, 2023

Hi @yamt , thanks a lot for updating me about this!

The new wasmi results look much more as I would expect, being roughly twice as slow as Wasm3.

It is interesting that your toywasm has similar startup performance to WAMR classic interpreter but performs way better than it at runtime.
What are your plans forward with toywasm?

Btw.: I am currently working on a new engine for wasmi making it more similar to how Wasm3 and the WAMR fast-interpreter work internally. Looking forward to how it performs when it is done in a few weeks/months. :)

@yamt
Copy link
Owner

yamt commented Jul 24, 2023

Hi @yamt , thanks a lot for updating me about this!

The new wasmi results look much more as I would expect, being roughly twice as slow as Wasm3.

good.

It is interesting that your toywasm has similar startup performance to WAMR classic interpreter but performs way better than it at runtime. What are your plans forward with toywasm?

actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.
i haven't investigated further. probably i should, sooner or later.

Btw.: I am currently working on a new engine for wasmi making it more similar to how Wasm3 and the WAMR fast-interpreter work internally. Looking forward to how it performs when it is done in a few weeks/months. :)

interesting.
is it this PR? wasmi-labs/wasmi#729

@Robbepop
Copy link
Contributor Author

Robbepop commented Jul 24, 2023

actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.

No runtime can be efficient for all the use cases - at least that's what I learned from working on wasmi. The only hope for a general purpose runtime is to fix all the potential weak spots so that it at least isn't terrible with any potential use case.

Due to your awesome benchmarks I see a huge potential in lazy Wasm compilation for fixing one such weak spot for startup time since a few benchmarked runtimes profit quite a bit because of their lazy compilation and/or Wasm validation.

is it this PR? wasmi-labs/wasmi#729

Yes it is. Although still super WIP at this point. Everything is subject to change. Still trying to figure out best designs for tackling certain problems/challenges. Trade-offs here and there, I just hope all the work will be worth it in the end. I was trying to read code from Wasm3 and WAMR fast interpreter for inspiration for certain problems but in all honesty I find both rather hard to read. Not used to reading high-density C code.

@yamt
Copy link
Owner

yamt commented Jul 25, 2023

actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.

No runtime can be efficient for all the use cases - at least that's what I learned from working on wasmi. The only hope for a general purpose runtime is to fix all the potential weak spots so that it at least isn't terrible with any potential use case.

sure.

being considerably slower than a similar engine (in my case iwasm classic) for a specific app is likely a sign of weak spots, or even a bug.

Due to your awesome benchmarks I see a huge potential in lazy Wasm compilation for fixing one such weak spot for startup time since a few benchmarked runtimes profit quite a bit because of their lazy compilation and/or Wasm validation.

i wonder how common sparsely-used wasm modules like ffmpeg.wasm are.

is it this PR? paritytech/wasmi#729

Yes it is. Although still super WIP at this point. Everything is subject to change. Still trying to figure out best designs for tackling certain problems/challenges. Trade-offs here and there, I just hope all the work will be worth it in the end. I was trying to read code from Wasm3 and WAMR fast interpreter for inspiration for certain problems but in all honesty I find both rather hard to read. Not used to reading high-density C code.

a lot of interesting ideas in the PR. i'm looking forward to see how it performs.

@yamt
Copy link
Owner

yamt commented Jul 30, 2023

thank you for explanation.

wrt jit-bombing, i have a few crafted wasm modules in https://github.com/yamt/toywasm/tree/master/wat.
from my experience, wamr is very weak in that regard.

@Robbepop
Copy link
Contributor Author

Robbepop commented Jul 31, 2023

wrt jit-bombing, i have a few crafted wasm modules in https://github.com/yamt/toywasm/tree/master/wat.

that's super cool! 🚀
I wish the Wasm standard test suite would include some JIT comb tests.

wamr is very weak in that regard

did you also test Wasmer singlepass? They claim linear time Wasm -> x86 machine code generation and while working on Wasm -> wasmi bytecode translation in the register-machine approach of the new PR I found that to be hard to realize with the Wasm multi-value proposal in certain cases. I hope I will find a solution to the cases I have found so far.

@yamt
Copy link
Owner

yamt commented Aug 1, 2023

did you also test Wasmer singlepass? They claim linear time Wasm -> x86 machine code generation and while working on Wasm -> wasmi bytecode translation in the register-machine approach of the new PR I found that to be hard to realize with the Wasm multi-value proposal in certain cases. I hope I will find a solution to the cases I have found so far.

no.
i have only used homebrew-installed binary of wasmer, which seems to have only cranelift enabled.

iirc wasmer allows only small numbers of function results. (1000?)
is linearity important even for such small number?

@Robbepop
Copy link
Contributor Author

Robbepop commented Aug 2, 2023

is linearity important even for such small number?

It depends on how bad it is:

  • Is it O(n*log(n))? With n=1000 it would be roughly 20k iterations so probably not great, not terrible.
  • Is it O(n^2) or worse? Then n=1000 means 1M iterations which is probably pretty bad.

As a rule of thumb in wasmi we try to avoid anything worse than O(n*log(n)).

@yamt
Copy link
Owner

yamt commented Aug 2, 2023

is linearity important even for such small number?

It depends on how bad it is:

* Is it O(n*log(n))? With n=1000 it would be roughly 20k iterations so probably not great, not terrible.

* Is it O(n^2) or worse? Then n=1000 means 1M iterations which is probably pretty bad.

As a rule of thumb in wasmi we try to avoid anything worse than O(n*log(n)).

ok. it makes sense.

toywasm at this point prefers simplicity and uses O(n^2) logic in a few places. (eg. import/export handling)

@Robbepop
Copy link
Contributor Author

Robbepop commented Aug 10, 2023

toywasm at this point prefers simplicity and uses O(n^2) logic in a few places. (eg. import/export handling)

While implementing support for Wasm multi-value proposal I found certain cases where I have not found any solution to provide at least O(n*log(n)). Just noticed that Wasmer singlepass with its linear guarantees actually disables Wasm multi-value support. So maybe they also did not find a linear solution to certain cases: https://docs.rs/wasmer-compiler-singlepass/latest/src/wasmer_compiler_singlepass/config.rs.html#43

Instead of simply not supporting Wasm multi-value Wasm proposal in wasmi my idea is to implementing something that I call compilation fuel which you can set to some value linear to the size of your Wasm blob for example and if the Wasm translation phase runs out of this translation fuel the translation is stopped and an error is returned. This way you can still guarantee linear time behavior to users while having non-linear algorithms for certain cases, especially if those cases are considered edge cases or impractical cases.

@yamt
Copy link
Owner

yamt commented Aug 14, 2023

toywasm at this point prefers simplicity and uses O(n^2) logic in a few places. (eg. import/export handling)

While implementing support for Wasm multi-value proposal I found certain cases where I have not found any solution to provide at least O(n*log(n)). Just noticed that Wasmer singlepass with its linear guarantees actually disables Wasm multi-value support. So maybe they also did not find a linear solution to certain cases: https://docs.rs/wasmer-compiler-singlepass/latest/src/wasmer_compiler_singlepass/config.rs.html#43

i don't understand what's difficult with multi-value.
isn't it almost same as function parameters?
but if you and wasmer dev independently found it difficult, i guess it's difficult. :-)

Instead of simply not supporting Wasm multi-value Wasm proposal in wasmi my idea is to implementing something that I call compilation fuel which you can set to some value linear to the size of your Wasm blob for example and if the Wasm translation phase runs out of this translation fuel the translation is stopped and an error is returned. This way you can still guarantee linear time behavior to users while having non-linear algorithms for certain cases, especially if those cases are considered edge cases or impractical cases.

interesting idea.
at least it sounds nicer than having small hard limits.

@Robbepop
Copy link
Contributor Author

Robbepop commented Aug 22, 2023

i don't understand what's difficult with multi-value.
isn't it almost same as function parameters?

Let me provide you with an example Wasm function that would be problematic according to my personal knowledge:

(func (param i32) (result i32 i32)
    (i32.const 10)
    (i32.const 20)
    (br_if 0 (local.get 0))
    (br_if 0 (local.get 0))
)

Each of the br_if are conditional return instructions. Both need to return 2 i32 values. During translation we need to walk the live value stack to see which of the values or registers we need to return. In a stack machine model we don't have to do this since we already rely on Wasm validation to check for valid stack heights but in a register-machine model we have to do a lightweight register allocation for thing like these.

Now imagine not having just 2 i32 return values but 10_000 and not just 2 br_if but 10_000, then you end up with a quadratic behavior with 10k^2 iterations in total.

There are also examples that evolve block parameters/results and not just function results which are also allowed by the multi-value Wasm proposal. So it is even more complex than this above example demonstrates.

For the simple above example we could probably come up with an optimization but it quickly gets more and more complex so that we cannot really fix this attack vector by implementing more and more specialized optimization variants as you probably can imagine.

@yamt
Copy link
Owner

yamt commented Aug 23, 2023

i don't understand what's difficult with multi-value.
isn't it almost same as function parameters?

Let me provide you with an example Wasm function that would be problematic according to my personal knowledge:

(func (param i32) (result i32 i32)
    (i32.const 10)
    (i32.const 20)
    (br_if 0 (local.get 0))
    (br_if 0 (local.get 0))
)

Each of the br_if are conditional return instructions. Both need to return 2 i32 values. During translation we need to walk the live value stack to see which of the values or registers we need to return. In a stack machine model we don't have to do this since we already rely on Wasm validation to check for valid stack heights but in a register-machine model we have to do a lightweight register allocation for thing like these.

Now imagine not having just 2 i32 return values but 10_000 and not just 2 br_if but 10_000, then you end up with a quadratic behavior with 10k^2 iterations in total.

There are also examples that evolve block parameters/results and not just function results which are also allowed by the multi-value Wasm proposal. So it is even more complex than this above example demonstrates.

For the simple above example we could probably come up with an optimization but it quickly gets more and more complex so that we cannot really fix this attack vector by implementing more and more specialized optimization variants as you probably can imagine.

thank you for explanation.

when you say quadratic, do you mean O(n*m) where n = number of br_if and m = number of values in the return type?

if so, isn't the wasm validation logic itself already quadratic, regardless of register allocations?
for each of 10_000 br_if, the validation logic needs to check 10_000 types on the top of the stack.
at least it's how the validation logic in toywasm would work.

@Robbepop
Copy link
Contributor Author

if so, isn't the wasm validation logic itself already quadratic, regardless of register allocations?
for each of 10_000 br_if, the validation logic needs to check 10_000 types on the top of the stack.
at least it's how the validation logic in toywasm would work.

Ah yeah you are totally right! So it is even worse than I hoped. I haven't had validation logic in mind since wasmi uses the wasmparser crate for parsing and validation which is hosted by the BytecodeAlliance. So maybe even the compilation fuel would not be ideal since compilation happens after validation per bytecode so there is a chance to step over the fuel limit during validation in certain cases. Maybe adding validation fuel to wasmparser could help but that's probably overkill.

@Robbepop
Copy link
Contributor Author

Hey @yamt , have you ever taken note of Ben Tizer's Wizard Wasm runtime?

From what I know it uses a similar approach as toywasm trying to interpret Wasm bytecode directly without an intermediate compilation or rewrite step.

@yamt
Copy link
Owner

yamt commented Sep 12, 2023

Hey @yamt , have you ever taken note of Ben Tizer's Wizard Wasm runtime?

* GitHub: https://github.com/titzer/wizard-engine

* Youtube talk at WasmCon: https://www.youtube.com/watch?v=TFt6ZjieSvQ&t=9933s

From what I know it uses a similar approach as toywasm trying to interpret Wasm bytecode directly without an intermediate compilation or rewrite step.

i have heard of the runtime. but i didn't know anything beyond it was written in an exotic language.
thank you.

the slides seems suggesting their interpreter performance is comparable to the wasm3.
(https://youtu.be/TFt6ZjieSvQ?t=11470)
i'm interested how it's achieved.

unfortunately their wasi support is too incomplete to run the benchmarks i usually use though.

@Robbepop
Copy link
Contributor Author

Robbepop commented Sep 12, 2023

unfortunately their wasi support is too incomplete to run the benchmarks i usually use though.

ah that is very unfortunate!

the slides seems suggesting their interpreter performance is comparable to the wasm3.

I think Wasm3 is still a bit faster, but yes, performance seems to be at least comparable. From what I know it was written in raw assembler to archive this kind of performance. So it is not super portable.

@yamt
Copy link
Owner

yamt commented Sep 14, 2023

the slides seems suggesting their interpreter performance is comparable to the wasm3.

I think Wasm3 is still a bit faster, but yes, performance seems to be at least comparable. From what I know it was written in raw assembler to archive this kind of performance. So it is not super portable.

wow.

@Robbepop
Copy link
Contributor Author

I just released v0.32.0-beta.0 - the beta version of the upcoming Wasmi v0.32.0 release.

This adds the register-machine bytecode based engine executor. Benchmarks so far concluded that it compiles roughly 30% slower and executes roughly 80-100% faster, reaching more or less Wasm3 performance on some instances. Although there are some performance issues with certain machine architectures that need to be fixed before the stable release.

This version also adds lazy function compilation which combined with unchecked Wasm validation (via Module::new_unchecked) speeds up startup times by up to 27 times.

Given that toywasm experiments with faster startup times I wonder if lazy function translation could be interesting for it as well. At least for Wasmi (and Wasm3) it turned out to be extremely successful. Idea: Doing nothing is still faster than doing minimal work. :)

@yamt
Copy link
Owner

yamt commented Dec 18, 2023

I just released v0.32.0-beta.0 - the beta version of the upcoming Wasmi v0.32.0 release.

This adds the register-machine bytecode based engine executor. Benchmarks so far concluded that it compiles roughly 30% slower and executes roughly 80-100% faster, reaching more or less Wasm3 performance on some instances. Although there are some performance issues with certain machine architectures that need to be fixed before the stable release.

This version also adds lazy function compilation which combined with unchecked Wasm validation (via Module::new_unchecked) speeds up startup times by up to 27 times.

sounds great. i will use that version (or later version) when i run the benchmark next time.

Given that toywasm experiments with faster startup times I wonder if lazy function translation could be interesting for it as well. At least for Wasmi (and Wasm3) it turned out to be extremely successful. Idea: Doing nothing is still faster than doing minimal work. :)

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.
but i might change my mind in future.

@Robbepop
Copy link
Contributor Author

Robbepop commented Dec 18, 2023

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.

For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

@yamt
Copy link
Owner

yamt commented Dec 20, 2023

For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.1 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.1. :)

ok!

yamt added a commit that referenced this issue Feb 6, 2024
yamt added a commit that referenced this issue Feb 6, 2024
@yamt
Copy link
Owner

yamt commented Feb 9, 2024

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.

For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

i tried 0.32.0-beta.5. it didn't work well. wasmi-labs/wasmi#934

yamt added a commit that referenced this issue Mar 11, 2024
yamt added a commit that referenced this issue Mar 11, 2024
@yamt
Copy link
Owner

yamt commented Mar 11, 2024

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.
For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

i tried 0.32.0-beta.5. it didn't work well. wasmi-labs/wasmi#934

i tried v0.32.0-beta.7. it worked. #143

@Robbepop
Copy link
Contributor Author

Robbepop commented Mar 11, 2024

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.
For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

i tried 0.32.0-beta.5. it didn't work well. wasmi-labs/wasmi#934

i tried v0.32.0-beta.7. it worked. #143

Great! I am very happy that it works now. 🥳
Thanks a lot for sharing your benchmarks!

Although the runtime numbers for Wasmi (~31s) look very off especially when compared to old Wasmi with ~34s. In all practical (non-artificial) tests we conducted benchmarks so far between the two engines we usually found at least 50-150% performance improvement from old Wasmi to new Wasmi.
However, we also saw one incident with our benchmark runners that experienced extreme inefficiency and thus nearly the same performance with both engines, maybe that's what is happening on your machine, too.
The startup performance numbers look as expected, though.

Unfortunately I cannot tell what is or was causing the extreme inefficiency for our old benchmark runners (that looks similar to yours) and if it is even fixable. We know it was an inefficiency problem (or some sort of bug) because we saw that the new Wasmi executed via Wasmtime was actually faster than when run natively which does not make sense at all.

edit: Here is a screenshot of someone benchmarking many different execution engines, amonst others also Wasmi (stack), Wasmi (register) and Wasm3 and as you can see, at least in this independent set of benchmarks Wasmi (register) even outperformed Wasm3 in one of the 2 tests by ~10%. Note though that these benchmarks are already quite dated and a lot improvements have made it into Wasmi (register) since then. So without miscompilation (or whatever is causing the inefficiency) I can expect Wasmi (register) to at least be on par with Wasm3.
https://forum.polkadot.network/t/announcing-polkavm-a-new-risc-v-based-vm-for-smart-contracts-and-possibly-more/3811/52?u=robinf

I really really hope I can find out what is causing those inefficiencies (or miscompilations) on some of those hardware systems. :S

@Robbepop Robbepop mentioned this issue Mar 11, 2024
@yamt
Copy link
Owner

yamt commented Mar 12, 2024

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.
For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

i tried 0.32.0-beta.5. it didn't work well. wasmi-labs/wasmi#934

i tried v0.32.0-beta.7. it worked. #143

Great! I am very happy that it works now. 🥳 Thanks a lot for sharing your benchmarks!

Although the runtime numbers for Wasmi (~31s) look very off especially when compared to old Wasmi with ~34s. In all practical (non-artificial) tests we conducted benchmarks so far between the two engines we usually found at least 50-150% performance improvement from old Wasmi to new Wasmi. However, we also saw one incident with our benchmark runners that experienced extreme inefficiency and thus nearly the same performance with both engines, maybe that's what is happening on your machine, too. The startup performance numbers look as expected, though.

Unfortunately I cannot tell what is or was causing the extreme inefficiency for our old benchmark runners (that looks similar to yours) and if it is even fixable. We know it was an inefficiency problem (or some sort of bug) because we saw that the new Wasmi executed via Wasmtime was actually faster than when run natively which does not make sense at all.

heh.
maybe you crafted code which cranelift compiles better than llvm. :-)
i've seen similar with toywasm on wamr aot.

edit: Here is a screenshot of someone benchmarking many different execution engines, amonst others also Wasmi (stack), Wasmi (register) and Wasm3 and as you can see, at least in this independent set of benchmarks Wasmi (register) even outperformed Wasm3 in one of the 2 tests by ~10%. Note though that these benchmarks are already quite dated and a lot improvements have made it into Wasmi (register) since then. So without miscompilation (or whatever is causing the inefficiency) I can expect Wasmi (register) to at least be on par with Wasm3. https://forum.polkadot.network/t/announcing-polkavm-a-new-risc-v-based-vm-for-smart-contracts-and-possibly-more/3811/52?u=robinf

nice.

I really really hope I can find out what is causing those inefficiencies (or miscompilations) on some of those hardware systems. :S

as you know, performance is sometimes subtle and interesting. i want to know the cause too.

the output of "toywasm --version" includes a version string like "toywasm v43.0.0".
in the past, when i changed it to be a bit longer, eg. "toywasm v43.0.0-foo", it changed coremark number by ~10%.
my wild guess was it was something related to the layout in the executable. but i couldn't find the cause.

@Robbepop
Copy link
Contributor Author

maybe you crafted code which cranelift compiles better than llvm. :-)
i've seen similar with toywasm on wamr aot.

Uhhh, that is actually an interesting thought! When testing Wasm performance for Wasmi the pipeline went trough LLVM -> wasm-opt -> Cranelift (via Wasmtime), so this thought didn't occur to me but it for sure is interesting.

My main problem is that I do not have the hardware for which Wasmi compiles badly to tinker with it. On my own machine Wasmi compiles okay and on hitchhooker's machine it compiles even better.

My suspicion is the execution loop that is based on a vanilla loop+match construct which is very fragile to optimizations. I wish I could use a tail-call based dispatcher because it allows for more control over the code but tail calls sadly do not exist in Rust, as of today.

as you know, performance is sometimes subtle and interesting. i want to know the cause too.

I will keep you updated if I ever find out. 🙈

the output of "toywasm --version" includes a version string like "toywasm v43.0.0".
in the past, when i changed it to be a bit longer, eg. "toywasm v43.0.0-foo", it changed coremark number by ~10%.
my wild guess was it was something related to the layout in the executable. but i couldn't find the cause.

Oh wow, that sounds super awkward, too! Optimization in those scenarios can feel a bit cumbersome because it feels like we are missing some knobs for control in certain areas (e.g. executable layout).

yamt added a commit that referenced this issue Apr 1, 2024
yamt added a commit that referenced this issue Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants