-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nullptr dereference in WASM code path #81
Comments
Can confirm, this caused the result in jelmervdl/translatelocally-web-ext#14. Recompiling the wasm code with the patch mentioned above, and forcing the fallback path fixes translation. |
The nullptr was introduced at the very beginning. I'm even more confused now how this code "worked" for so long. |
I just took a look at this. @jelmervdl you are right, the problem is in the wasm intgemm fallback, because whoever implemented it mapped |
The goal of the wasm gemm interface was to support the fastest configuration I am not sure if #82 is the right way to resolve this issue because we always expect an input bias with non-zero size. This is more of an API question which needs to be discussed here before we land anything in Firefox:
Perhaps @XapaJIaMnu or @kpu can throw some light? cc @andrenatal @lonnen |
Sorry, my surprise wasn't how could this bug have made it into the running code, but how did it not cause issues earlier since it looks to me like that code is called for both the base and tiny11 models I tested. I've attached two log files for a tiny its related base model to compare the intgemm function calls. It shows that both the tiny and the base models call the nullptr (e.g. line 2330), but somehow it doesn't seem to have any noticeable impact on tiny models. Either that's a bug on its own (maybe a graph node who's output isn't used?) |
That is surprising. Are you using the exact same model config for both tiny and base models? May I ask the gemm precision that you are using? |
Here is a script that produces such intgemm call log files: https://gist.github.com/jelmervdl/7dc651fc53889d016261eaa0b3f30db8 It should work with nodejs 17 and up. The paths right now are such that it should work with the wasm build code if you place it in the wasm/ subdirectory, but that's pretty easy to change. It downloads the tiny en-de model and then produces a file |
@abhi-agg the thing is different models take different code paths inside marian. The tiny model (apparently) only uses the affine operation during decoding, whereas the base model also makes use of dot and dot doesn't have a bias, whereas affine does. There is a reason why we implement both. And indeed maybe nobody noticed the discrepancy at the time of the review, but now we did and it should be fixed. Edit: |
It's a bug
You @abhi-agg wrote the line that calls with You've been handed the fix on a silver platter and it is a short fix. If arguing about the API is easier for you than putting the fix in Firefox, that's a process problem with Firefox. |
Bug description
When trying to run the WASM compiled version of bergamot-translator, some models (specifically, the student.base models from browsermt/students) produce invalid output. Typically a bunch of repetitions of a single word or character per sentence. Not unlike this bug report.
In an attempt to figure out what was going on, I basically compiled the WASM code path into a native app so I could run it through
llvm
. This caught a nullptr dereference inside intgemm which was ultimately caused by thenullptr
which is supposed to beconst float* input_bias
this line:marian-dev/src/tensors/cpu/intgemm_interface.h
Line 386 in 53c4f7e
This then translates/binds/magics into
marian-dev/src/tensors/cpu/wasm_intgemm_fallback.cpp
Lines 66 to 72 in 53c4f7e
That nullptr then ends up here as
config.bias_addr
:And I'm not sure which of these implementations of
kernels::add_bias
it ends up at, but none of these seem to be happy withnullptr
.As an experiment, I re-implemented the fallback function to handle the nulltr and call the callback without the bias term if the bias was null:
That fixes both the nullptr dereference error and the broken model output for my non-wasm wasm build.
I'm reporting this as a bug as I imagine there was some reasoning behind writing a
nullptr
there.I'm also surprised that what seems to be buggy code compiled to a (mostly) functioning wasm build that works for the tiny models. The base and tiny11 models don't seem to differ all that much. Same layers it seems, they're just larger in base?
Lastly, I'm not sure how to fix this. The quick hack above won't work since that's the fallback code path. Something similar would need to be added to the intgemm code in Mozilla's tree.
Reproduce
Tested by building app/bergamot.cpp from bergamot-translator, after patching cmake files to not pass emcc specific flags when COMPILE_WASM is defined. Also changed wasm_intgemm_interface.h to remove the compiler attributes, and wasm_intgemm_fallback.cpp to remove all the
Fallback
bits from the names so that those functions get called directly.I also needed to patch the config.intgemm8bitalpha.yml to include:
After that, this works (or without the patch to fallback.cpp, gives a useful crash):
The text was updated successfully, but these errors were encountered: