nullptr dereference in WASM code path #81

jelmervdl · 2022-03-15T00:09:27Z

Bug description

When trying to run the WASM compiled version of bergamot-translator, some models (specifically, the student.base models from browsermt/students) produce invalid output. Typically a bunch of repetitions of a single word or character per sentence. Not unlike this bug report.

In an attempt to figure out what was going on, I basically compiled the WASM code path into a native app so I could run it through llvm. This caught a nullptr dereference inside intgemm which was ultimately caused by the nullptr which is supposed to be const float* input_bias this line:

marian-dev/src/tensors/cpu/intgemm_interface.h

Line 386 in 53c4f7e

    
           int8PrepareBias((const int8_t *)b->data(), scale_a, 0.0 /*zero_point_a*/, scale_b, 0.0 /*zero_point_b*/, rows(b), cols(b), nullptr/*input_bias*/, val_->data());

This then translates/binds/magics into

marian-dev/src/tensors/cpu/wasm_intgemm_fallback.cpp

Lines 66 to 72 in 53c4f7e

    
             float unquant_factor = (-1) * ((127.0f / scale_A) * (127.0f / scale_B)) / (127.0f); 
        
             intgemm::Int8Shift::PrepareBias( 
        
                 input_B_prepared, 
        
                 width, 
        
                 cols_B, 
        
                 intgemm::callbacks::UnquantizeAndAddBiasAndWrite(unquant_factor, input_bias, output)); 
        
           }

That nullptr then ends up here as config.bias_addr:

template <> class CallbackImpl<CPUType::CPU_NAME, UnquantizeAndAddBiasAndWrite> {
  ...
    auto result = kernels::unquantize(input, mult_reg);
    result = kernels::add_bias(result, config.bias_addr, info.col_idx);
    kernels::write(result, config.output_addr, info.row_idx * info.cols + info.col_idx);
  ...
};

And I'm not sure which of these implementations of kernels::add_bias it ends up at, but none of these seem to be happy with nullptr.

As an experiment, I re-implemented the fallback function to handle the nulltr and call the callback without the bias term if the bias was null:

extern "C" void int8PrepareBias(const int8_t* input_B_prepared,
                                        float scale_A,
                                        float zero_point_A,
                                        float scale_B,
                                        float zero_point_B,
                                        Index width,
                                        Index cols_B,
                                        const float* input_bias,
                                        float* output) {
  float unquant_factor = (-1) * ((127.0f / scale_A) * (127.0f / scale_B)) / (127.0f);
  if (input_bias == nullptr) {
    intgemm::Int8Shift::PrepareBias(
        input_B_prepared,
        width,
        cols_B,
        intgemm::callbacks::UnquantizeAndWrite(unquant_factor, output));
  } else {
    intgemm::Int8Shift::PrepareBias(
        input_B_prepared,
        width,
        cols_B,
        intgemm::callbacks::UnquantizeAndAddBiasAndWrite(unquant_factor, input_bias, output));
  }
}

That fixes both the nullptr dereference error and the broken model output for my non-wasm wasm build.

I'm reporting this as a bug as I imagine there was some reasoning behind writing a nullptr there.

I'm also surprised that what seems to be buggy code compiled to a (mostly) functioning wasm build that works for the tiny models. The base and tiny11 models don't seem to differ all that much. Same layers it seems, they're just larger in base?

Lastly, I'm not sure how to fix this. The quick hack above won't work since that's the fallback code path. Something similar would need to be added to the intgemm code in Mozilla's tree.

Reproduce

Tested by building app/bergamot.cpp from bergamot-translator, after patching cmake files to not pass emcc specific flags when COMPILE_WASM is defined. Also changed wasm_intgemm_interface.h to remove the compiler attributes, and wasm_intgemm_fallback.cpp to remove all the Fallback bits from the names so that those functions get called directly.

I also needed to patch the config.intgemm8bitalpha.yml to include:

max-length-break: 128
mini-batch-words: 1024

After that, this works (or without the patch to fallback.cpp, gives a useful crash):

> lldb -- app/bergamot --model-config-paths ~/.config/translateLocally/ende.student.base-1647129297/config.intgemm8bitalpha.yml
(lldb) target create "app/bergamot"
Current executable set to '/Users/jelmer/Workspace/statmt/firefox-translations/bergamot-translator/build/app/bergamot' (x86_64).
(lldb) settings set -- target.run-args  "--model-config-paths" "/Users/jelmer/.config/translateLocally/ende.student.base-1647129297/config.intgemm8bitalpha.yml"
(lldb) run
Process 3022 launched: '/Users/jelmer/Workspace/statmt/firefox-translations/bergamot-translator/build/app/bergamot' (x86_64)
Hello world!
Hallo Welt!
Process 3022 exited with status = 0 (0x00000000)

The text was updated successfully, but these errors were encountered:

jelmervdl · 2022-03-15T10:05:04Z

Can confirm, this caused the result in jelmervdl/translatelocally-web-ext#14. Recompiling the wasm code with the patch mentioned above, and forcing the fallback path fixes translation.

jelmervdl · 2022-03-15T10:20:02Z

The nullptr was introduced at the very beginning. I'm even more confused now how this code "worked" for so long.

XapaJIaMnu · 2022-03-16T00:25:06Z

I just took a look at this. @jelmervdl you are right, the problem is in the wasm intgemm fallback, because whoever implemented it mapped PrepareBias and PrepareFakeBias to the same function, and just passing a nullptr in the case of a fake bias, which is clearly not correct. The solution that you entered for the fallback is the correct one. You should PR it.

jelmervdl · 2022-03-23T12:46:03Z

@abhi-agg could you confirm you've also seen this, and help get this fix into Firefox?

abhi-agg · 2022-03-25T16:03:06Z

I'm reporting this as a bug as I imagine there was some reasoning behind writing a nullptr there.

I'm also surprised that what seems to be buggy code compiled to a (mostly) functioning wasm build that works for the tiny models.

Lastly, I'm not sure how to fix this. The quick hack above won't work since that's the fallback code path. Something similar would need to be added to the intgemm code in Mozilla's tree.

@abhi-agg could you confirm you've also seen this, and help get this fix into Firefox?

The goal of the wasm gemm interface was to support the fastest configuration int8shiftAlphaAll with the student tiny models. This combination must not be invoking PrepareFakeBiasForBNodeOp and that's why we never saw this issue. However, it could also have been caught during the review of the PR.

I am not sure if #82 is the right way to resolve this issue because we always expect an input bias with non-zero size.

This is more of an API question which needs to be discussed here before we land anything in Firefox:

When is PrepareFakeBias invoked and why is there a need to prepare a fake bias at all?
What's the difference b/w student base and tiny models in context of PrepareFakeBias call? It seems to me that PrepareFakeBias is being invoked for student base models but not for student tiny models.
Can PrepareFakeBias ever be invoked for tiny models with int8shiftAlphaAll?
Can PrepareFakeBias ever be invoked for tiny models with int8 (to support Ukranian centric models as per Dump of Ukranian-centric models mozilla/firefox-translations#166 (comment))?

Perhaps @XapaJIaMnu or @kpu can throw some light? cc @andrenatal @lonnen

jelmervdl · 2022-03-25T16:24:14Z

The goal of the wasm gemm interface was to support the fastest configuration int8shiftAlphaAll with the student tiny models. This combination must not be invoking PrepareFakeBiasForBNodeOp and that's why we never saw this issue. However, it could also have been caught during the review of the PR.

Sorry, my surprise wasn't how could this bug have made it into the running code, but how did it not cause issues earlier since it looks to me like that code is called for both the base and tiny11 models I tested.

I've attached two log files for a tiny its related base model to compare the intgemm function calls. It shows that both the tiny and the base models call the nullptr (e.g. line 2330), but somehow it doesn't seem to have any noticeable impact on tiny models. Either that's a bug on its own (maybe a graph node who's output isn't used?)

tiny.log
base.log

abhi-agg · 2022-03-25T17:05:08Z

It shows that both the tiny and the base models call the nullptr (e.g. line 2330), but somehow it doesn't seem to have any noticeable impact on tiny models.

That is surprising. Are you using the exact same model config for both tiny and base models? May I ask the gemm precision that you are using?

jelmervdl · 2022-03-25T17:56:14Z

gemm-precision: int8shiftAlphaAll in my native tests, and the same for the tests I did in Firefox.

Here is a script that produces such intgemm call log files: https://gist.github.com/jelmervdl/7dc651fc53889d016261eaa0b3f30db8

It should work with nodejs 17 and up. The paths right now are such that it should work with the wasm build code if you place it in the wasm/ subdirectory, but that's pretty easy to change. It downloads the tiny en-de model and then produces a file intgemm-calls.json. Then it is a bit of searching through that file for "int8_prepare_bias" with the 0 as the second-to-last argument, but if you start at the bottom it should fairly quickly pop up.

XapaJIaMnu · 2022-03-29T09:17:41Z

@abhi-agg the thing is different models take different code paths inside marian. The tiny model (apparently) only uses the affine operation during decoding, whereas the base model also makes use of dot and dot doesn't have a bias, whereas affine does.

There is a reason why we implement both. And indeed maybe nobody noticed the discrepancy at the time of the review, but now we did and it should be fixed.

Edit:
Prepare Fake Bias is used when we want to use the faster code path (which always requires bias) when using the dot operation which doesn't have a bias.

kpu · 2022-03-29T10:28:04Z

It's a bug

I'm reporting this as a bug as I imagine there was some reasoning behind writing a nullptr there.
I'm also surprised that what seems to be buggy code compiled to a (mostly) functioning wasm build that works for the tiny models.
Lastly, I'm not sure how to fix this. The quick hack above won't work since that's the fallback code path. Something similar would need to be added to the intgemm code in Mozilla's tree.
@abhi-agg could you confirm you've also seen this, and help get this fix into Firefox?

The goal of the wasm gemm interface was to support the fastest configuration int8shiftAlphaAll with the student tiny models. This combination must not be invoking PrepareFakeBiasForBNodeOp and that's why we never saw this issue. However, it could also have been caught during the review of the PR.

I am not sure if #82 is the right way to resolve this issue because we always expect an input bias with non-zero size.

This is more of an API question which needs to be discussed here before we land anything in Firefox:
1. When is `PrepareFakeBias` invoked and why is there a need to prepare a `fake` bias at all?

2. What's the difference b/w student base and tiny models in context of `PrepareFakeBias` call? It seems to me that `PrepareFakeBias` is being invoked for student base models but not for student tiny models.

3. Can `PrepareFakeBias` ever be invoked for tiny models with `int8shiftAlphaAll`?

4. Can `PrepareFakeBias` ever be invoked for tiny models with `int8` (to support Ukranian centric models as per [Dump of Ukranian-centric models mozilla/firefox-translations#166 (comment)](https://github.com/mozilla/firefox-translations/issues/166#issuecomment-1078595427))?
Perhaps @XapaJIaMnu or @kpu can throw some light? cc @andrenatal @lonnen

You @abhi-agg wrote the line that calls with nullptr:

https://github.com/browsermt/marian-dev/blame/53c4f7e4537dbf7782c583e98f50e513f0a27541/src/tensors/cpu/intgemm_interface.h#L386

You've been handed the fix on a silver platter and it is a short fix.

If arguing about the API is easier for you than putting the fix in Firefox, that's a process problem with Firefox.

jelmervdl added bug Something isn't working wasm labels Mar 15, 2022

jelmervdl mentioned this issue Mar 15, 2022

TranslateLocally _base_ models do not work jelmervdl/translatelocally-web-ext#14

Closed

3 tasks

jelmervdl mentioned this issue Mar 16, 2022

Patch int8PrepareBias to handle nullptr #82

Merged

4 tasks

jerinphilip mentioned this issue Mar 25, 2022

Dump of Ukranian-centric models mozilla/firefox-translations#166

Closed

kpu assigned abhi-agg Mar 29, 2022

jerinphilip mentioned this issue Apr 7, 2022

Still seeing a lot of regressions on pt-br mozilla/firefox-translations#231

Closed

jerinphilip mentioned this issue Apr 15, 2022

ARM Backend using ruy for fp32 and int8 #79

Merged

3 tasks

XapaJIaMnu mentioned this issue Jun 11, 2022

Segfault on loading of some models mozilla/translation-service#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nullptr dereference in WASM code path #81

nullptr dereference in WASM code path #81

jelmervdl commented Mar 15, 2022

jelmervdl commented Mar 15, 2022

jelmervdl commented Mar 15, 2022

XapaJIaMnu commented Mar 16, 2022

jelmervdl commented Mar 23, 2022

abhi-agg commented Mar 25, 2022 •

edited

Loading

jelmervdl commented Mar 25, 2022

abhi-agg commented Mar 25, 2022

jelmervdl commented Mar 25, 2022

XapaJIaMnu commented Mar 29, 2022 •

edited

Loading

kpu commented Mar 29, 2022

nullptr dereference in WASM code path #81

nullptr dereference in WASM code path #81

Comments

jelmervdl commented Mar 15, 2022

Bug description

Reproduce

jelmervdl commented Mar 15, 2022

jelmervdl commented Mar 15, 2022

XapaJIaMnu commented Mar 16, 2022

jelmervdl commented Mar 23, 2022

abhi-agg commented Mar 25, 2022 • edited Loading

jelmervdl commented Mar 25, 2022

abhi-agg commented Mar 25, 2022

jelmervdl commented Mar 25, 2022

XapaJIaMnu commented Mar 29, 2022 • edited Loading

kpu commented Mar 29, 2022

abhi-agg commented Mar 25, 2022 •

edited

Loading

XapaJIaMnu commented Mar 29, 2022 •

edited

Loading