fix dropout scaling from p to 1/(1-p) in multihead attention #816

seryilmaz · 2020-04-30T01:19:00Z

Fixes dropout scaling in backward pass of multihead attention

kevinstephano

Looks good. I knew about this change and talked about it with Burc.

ptrblck · 2020-04-30T19:12:59Z

Thanks! :)

* fix dropout scaling from p to 1/(1-p) (NVIDIA#816) Co-authored-by: Sukru Eryilmaz <[email protected]> * Improvements to apex.mlp (NVIDIA#804) * update fused bias relu backward kernel * adding support for not require first layer dgrad * fix bug: wrong layer in requires grad * add infrastructure for optional bias and activation, currently only support no bias and no relu * make bias and relu optional separately * add sigmoid activation option * enable wider load/store for multi_tensor_apply kernels (NVIDIA#763) * modify MTA axpby for wider load/store * Make scale/axpby/l2/adam/lamb multi_tensor uses wider load * Changes to make xentropysoftmax load/store vectorized when possible: (NVIDIA#725) * Changes to make xentropysoftmax load/store vectorized when possible: Increase default ILP so that each thread handle 16 Bytes data in one step Make thread load/store longest vector possible Make unroll case handle adjacent data instead of strided, so same order compare to vector case * Add shift for not aligned case. Remove less than 16 bytes aligned access Co-authored-by: Burc Eryilmaz <[email protected]> Co-authored-by: Sukru Eryilmaz <[email protected]> Co-authored-by: Deyu Fu <[email protected]>

fix dropout scaling from p to 1/(1-p)

e75c0da

kevinstephano approved these changes Apr 30, 2020

View reviewed changes

ptrblck merged commit aad9300 into NVIDIA:master Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix dropout scaling from p to 1/(1-p) in multihead attention #816

fix dropout scaling from p to 1/(1-p) in multihead attention #816

seryilmaz commented Apr 30, 2020

kevinstephano left a comment

ptrblck commented Apr 30, 2020

fix dropout scaling from p to 1/(1-p) in multihead attention #816

fix dropout scaling from p to 1/(1-p) in multihead attention #816

Conversation

seryilmaz commented Apr 30, 2020

kevinstephano left a comment

Choose a reason for hiding this comment

ptrblck commented Apr 30, 2020