Disable fused causal attention #14732

tianleiwu · 2023-02-17T19:45:38Z

Description

There is accuracy regression in GPT-2 model. Top1 match rate (vs PyTorch model) drops about 1%. The cause is the fused causal attention uses fp16 accumulation. Disable it by default and user could use an environment variable ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1 to turn on it manually.

It also updated the GPT-2 parity test script to generate left side padding to reflect the actual usage.

To test:

python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m gpt2 --output gpt2.onnx -o -p fp16 --use_gpu

The top1-match-rate in the output is on-par with ORT 1.13.1.

Motivation and Context

yufenglee · 2023-02-18T17:53:00Z

I'm not sure it is a good idea to disable fMHA by defaut.
The accuracy check is based on dummy inputs which are meaningless. Intuitively, it tends to generate logits that are more neural to all tokens, i.e., no/less preference on next token. I would say it'd better to randomly select 1000 real(meaning) sentences as the test data set.
In addition, fMHA is only enabled for the context input, not for all iterations.

tianleiwu · 2023-02-21T17:51:43Z

I'm not sure it is a good idea to disable fMHA by defaut. The accuracy check is based on dummy inputs which are meaningless. Intuitively, it tends to generate logits that are more neural to all tokens, i.e., no/less preference on next token. I would say it'd better to randomly select 1000 real(meaning) sentences as the test data set. In addition, fMHA is only enabled for the context input, not for all iterations.

Good suggestion. We will improve the test with real sentences, and re-evaluate this later.
Based on current test result, I think it is better to turn off it by default. Even though 1% drop is small, it is still regression on 1.13.

There is accuracy regression in GPT-2 model. Top1 match rate (vs PyTorch model) drops about 1%. The cause is the fused causal attention uses fp16 accumulation. Disable it by default and add an environment variable ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1 to turn on it manually. It also updated the GPT-2 parity test script to generate left side padding to reflect the actual usage. To test: ``` python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m gpt2 --output gpt2.onnx -o -p fp16 --use_gpu ``` The top1-match-rate in the output is on-par with ORT 1.13.1.

Disable fused causal attention by default

e6154b4

tianleiwu requested review from zhanghuanrong and wangyems February 17, 2023 19:45

wangyems previously approved these changes Feb 17, 2023

View reviewed changes

fix code scan warning

d457282

tianleiwu dismissed wangyems’s stale review via d457282 February 17, 2023 20:50

wangyems approved these changes Feb 17, 2023

View reviewed changes

tianleiwu added the release:1.14 label Feb 17, 2023

Merge branch 'main' into tlwu/disable_fused_causal_att

c71ce73

tianleiwu merged commit c0d2472 into main Feb 21, 2023

tianleiwu deleted the tlwu/disable_fused_causal_att branch February 21, 2023 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable fused causal attention #14732

Disable fused causal attention #14732

tianleiwu commented Feb 17, 2023 •

edited

Loading

yufenglee commented Feb 18, 2023

tianleiwu commented Feb 21, 2023

Disable fused causal attention #14732

Disable fused causal attention #14732

Conversation

tianleiwu commented Feb 17, 2023 • edited Loading

Description

Motivation and Context

yufenglee commented Feb 18, 2023

tianleiwu commented Feb 21, 2023

tianleiwu commented Feb 17, 2023 •

edited

Loading