-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks #8
Comments
thank you.
i run it on my macbook. (MBP 15-inch 2018) |
Yeah maybe you did. I think this benchmark with so many different runtimes could even be extended into its own project/repo. Btw. in order to highlight the strength of the approach you took in your |
done. (actually i just pushed an unpushed result i found in local repo.) |
maybe. i myself am not interested in maintaining such a thing right now though.
thank you for your insights. i agree. |
i added a benchmark about startup time and memory consumption. |
Thank you for those benchmarks! |
@yamt thanks to your memory consumption benchmarks I took a better look at |
it's good to hear! thank you for letting me know. noted for the next run of the benchmark. |
looking forward :)
Oh wow, that's super interesting news since Having a Wasm runtime that is up to date with the standardised proposals is obviously very nice. |
i have a similar feeling. but i added it mainly for completeness. having said that, these "do more work per instruction" style instructions can be rather friendly to |
At Parity we even found that generating Wasm from Rust source is slightly smaller when enabling Wasm SIMD which is obviously great since translation time can be significant for some practical use cases and it usually linearly scales with Wasm blob size. I assume you used 64-bit cells for the value stack before the introduction of SIMD to |
i guess it uses llvm as a backend?
in toywasm, the value stack cell size depends on build-time configurations. before simd, there were two configurations:
after simd, there are three:
|
ah, that's very interesting and perfect for research about what cell size is the best for which use case. :) how much complexity did this add to the interpreter compared to having for example fixed 64-bit cell sizes?
yes, that's correct. very interesting approach and looking forward to all the results that you are going to pull off of this. :) in the past I have been using however, this is a rather artificial benchmark and probably less ideal than your Given these severe differences I think it is kinda important to tag your own benchmark results for reproducibility with the hardware (mainly CPU) and OS used. |
originally toywasm was using fixed 64-bit cells. besides that, i introduced i suppose that it can be simpler for "translating" interpreters like
interesting. i haven't thought about cpu differences much.
|
What just crossed my mind about cell sizes and SIMD support is the following: Maybe it is practical and efficient to have 2 different stacks, e.g. one stack with 64-bit cells and another stack with 128-bit cells. Both stacks are used simultaneously (push, pop) but exclusively for non-SIMD and SIMD instructions respectively. Due to Wasm validation phase and type checks it should probably be possible to support SIMD without touching the already existing stack and not using this 128-bit cell stack at all (and thus not affecting non-SIMD code) when not using SIMD instructions. Maybe I am overlooking something here. Although if this was efficient I assume it might introduce less complexity than different cell sizes or having SIMD instructions use 2 cells instead of 1. I am way into speculation here. Implementation/time needed to confirm haha. |
it's an interesting idea.
|
i rerun the benchmarks: wasmi has been improved a lot since the last time. (0.27.0) |
Awesome work @yamt and thanks a ton for those benchmarks! 🚀 I am especially fond of the fact that there is nearly no difference between Looks like a very successful research conclusion to me for your SIMD implementation in Concerning |
i guess ffmpeg.wasm (or, probably any C programs) is rather linear-memory intensive than value stack.
hmm. wrt 0.27.0, it might be an error on my side. |
Ah I thought your were simply installing
|
things like
it reminded me that, while i wanted to use lto=full for toywasm, cmake insisted to use lto=thin. Lines 83 to 87 in 9c88f24
in the meantime, i added a warning about this: 9ee47bf |
If you want to benchmark
So instead of Ever thought of writing a blog post with all your benchmarks about Wasm runtimes? :D Seems like you could pull off quite a bit of information there. |
thank you. i commented in the PR.
i have no interest in blog right now. |
i rerun with it and pushed the results. i also updated the procedure for wasmer. (it was not clearing cache as i intended.) |
Hi @yamt , thanks a lot for updating me about this! The new It is interesting that your Btw.: I am currently working on a new engine for |
good.
actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.
interesting. |
No runtime can be efficient for all the use cases - at least that's what I learned from working on Due to your awesome benchmarks I see a huge potential in lazy Wasm compilation for fixing one such weak spot for startup time since a few benchmarked runtimes profit quite a bit because of their lazy compilation and/or Wasm validation.
Yes it is. Although still super WIP at this point. Everything is subject to change. Still trying to figure out best designs for tackling certain problems/challenges. Trade-offs here and there, I just hope all the work will be worth it in the end. I was trying to read code from Wasm3 and WAMR fast interpreter for inspiration for certain problems but in all honesty I find both rather hard to read. Not used to reading high-density C code. |
sure. being considerably slower than a similar engine (in my case iwasm classic) for a specific app is likely a sign of weak spots, or even a bug.
i wonder how common sparsely-used wasm modules like ffmpeg.wasm are.
a lot of interesting ideas in the PR. i'm looking forward to see how it performs. |
thank you for explanation. wrt jit-bombing, i have a few crafted wasm modules in https://github.com/yamt/toywasm/tree/master/wat. |
that's super cool! 🚀
did you also test Wasmer singlepass? They claim linear time Wasm -> x86 machine code generation and while working on Wasm -> |
no. iirc wasmer allows only small numbers of function results. (1000?) |
It depends on how bad it is:
As a rule of thumb in |
ok. it makes sense. toywasm at this point prefers simplicity and uses O(n^2) logic in a few places. (eg. import/export handling) |
While implementing support for Wasm Instead of simply not supporting Wasm |
i don't understand what's difficult with multi-value.
interesting idea. |
Let me provide you with an example Wasm function that would be problematic according to my personal knowledge: (func (param i32) (result i32 i32)
(i32.const 10)
(i32.const 20)
(br_if 0 (local.get 0))
(br_if 0 (local.get 0))
) Each of the Now imagine not having just 2 There are also examples that evolve block parameters/results and not just function results which are also allowed by the For the simple above example we could probably come up with an optimization but it quickly gets more and more complex so that we cannot really fix this attack vector by implementing more and more specialized optimization variants as you probably can imagine. |
thank you for explanation. when you say quadratic, do you mean O(n*m) where n = number of br_if and m = number of values in the return type? if so, isn't the wasm validation logic itself already quadratic, regardless of register allocations? |
Ah yeah you are totally right! So it is even worse than I hoped. I haven't had validation logic in mind since |
Hey @yamt , have you ever taken note of Ben Tizer's Wizard Wasm runtime?
From what I know it uses a similar approach as |
i have heard of the runtime. but i didn't know anything beyond it was written in an exotic language. the slides seems suggesting their interpreter performance is comparable to the wasm3. unfortunately their wasi support is too incomplete to run the benchmarks i usually use though. |
ah that is very unfortunate!
I think Wasm3 is still a bit faster, but yes, performance seems to be at least comparable. From what I know it was written in raw assembler to archive this kind of performance. So it is not super portable. |
wow. |
I just released This adds the register-machine bytecode based engine executor. Benchmarks so far concluded that it compiles roughly 30% slower and executes roughly 80-100% faster, reaching more or less Wasm3 performance on some instances. Although there are some performance issues with certain machine architectures that need to be fixed before the stable release. This version also adds lazy function compilation which combined with unchecked Wasm validation (via Given that toywasm experiments with faster startup times I wonder if lazy function translation could be interesting for it as well. At least for Wasmi (and Wasm3) it turned out to be extremely successful. Idea: Doing nothing is still faster than doing minimal work. :) |
sounds great. i will use that version (or later version) when i run the benchmark next time.
i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it. |
I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software. For the next benchmark runs: I added support for lazy Wasm validation via |
ok! |
i tried 0.32.0-beta.5. it didn't work well. wasmi-labs/wasmi#934 |
i tried v0.32.0-beta.7. it worked. #143 |
Great! I am very happy that it works now. 🥳 Although the runtime numbers for Wasmi (~31s) look very off especially when compared to old Wasmi with ~34s. In all practical (non-artificial) tests we conducted benchmarks so far between the two engines we usually found at least 50-150% performance improvement from old Wasmi to new Wasmi. Unfortunately I cannot tell what is or was causing the extreme inefficiency for our old benchmark runners (that looks similar to yours) and if it is even fixable. We know it was an inefficiency problem (or some sort of bug) because we saw that the new Wasmi executed via Wasmtime was actually faster than when run natively which does not make sense at all. edit: Here is a screenshot of someone benchmarking many different execution engines, amonst others also Wasmi (stack), Wasmi (register) and Wasm3 and as you can see, at least in this independent set of benchmarks Wasmi (register) even outperformed Wasm3 in one of the 2 tests by ~10%. Note though that these benchmarks are already quite dated and a lot improvements have made it into Wasmi (register) since then. So without miscompilation (or whatever is causing the inefficiency) I can expect Wasmi (register) to at least be on par with Wasm3. I really really hope I can find out what is causing those inefficiencies (or miscompilations) on some of those hardware systems. :S |
heh.
nice.
as you know, performance is sometimes subtle and interesting. i want to know the cause too. the output of "toywasm --version" includes a version string like "toywasm v43.0.0". |
Uhhh, that is actually an interesting thought! When testing Wasm performance for Wasmi the pipeline went trough LLVM -> My main problem is that I do not have the hardware for which Wasmi compiles badly to tinker with it. On my own machine Wasmi compiles okay and on hitchhooker's machine it compiles even better. My suspicion is the execution loop that is based on a vanilla loop+match construct which is very fragile to optimizations. I wish I could use a tail-call based dispatcher because it allows for more control over the code but tail calls sadly do not exist in Rust, as of today.
I will keep you updated if I ever find out. 🙈
Oh wow, that sounds super awkward, too! Optimization in those scenarios can feel a bit cumbersome because it feels like we are missing some knobs for control in certain areas (e.g. executable layout). |
Hi @yamt ,
it is really cool that you put so many Wasm runtimes on your benchmarks for comparison!
I have a few questions though.
What hardware did you run the benchmarks on? Would be cool if you could write that down somewhere for reproducibility.
Also I saw that
wasmi
is included in the script but not showing in theREADME.md
.Did
wasmi
not work? If it works I'd be interested in the numbers on your machine. :)Unfortunately running the benchmark script requires quite a setup.
The text was updated successfully, but these errors were encountered: