Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GELU and Sigmoid epilogue to use a common template (and add SiLu, Hardswish epilogue) #379

Merged
merged 12 commits into from
Dec 18, 2021

Conversation

masahi
Copy link
Contributor

@masahi masahi commented Dec 13, 2021

Since Sigmoid and GeLU epilogue functors are almost identical, I propose to refactor them. The motivation is to make it easy to add another functors.

Now Sigmoid and GeLU epilogue functors are defined as:

using LinearCombinationSigmoid = LinearCombinationGeneric<Sigmoid, ElementOutput_, Count, ElementAccumulator_,
                                                          ElementCompute_, FloatRoundStyle::round_to_nearest>;

using LinearCombinationGELU = LinearCombinationGeneric<GELU, ElementOutput_, Count, ElementAccumulator_,
                                                          ElementCompute_, FloatRoundStyle::round_to_nearest>;

I'm going to add two new functors for SiLU and HardSwish activation, which can be implemented in the same manner.

What do you think? @hwu36

@masahi
Copy link
Contributor Author

masahi commented Dec 14, 2021

ok I've pushed two activation using this common template.

@masahi masahi changed the title Refactor GELU and Sigmoid epilogue to use a common template Refactor GELU and Sigmoid epilogue to use a common template (and add SiLu, Hardswish epilogue) Dec 14, 2021
@d-k-b
Copy link
Collaborator

d-k-b commented Dec 14, 2021

Thanks for helping clean this up! Can you add documentation with a short description and links to resources to the SiLu and HardSwish structs for future reference?

@masahi
Copy link
Contributor Author

masahi commented Dec 14, 2021

@d-k-b Thanks for the feedback, the doc is updated.

@hwu36
Copy link
Collaborator

hwu36 commented Dec 17, 2021

Overall, looks good to me. I think just change FloatRoundStyle::round_to_nearest to Round everywhere.

@masahi
Copy link
Contributor Author

masahi commented Dec 18, 2021

Oops thanks for spotting this, fixed.

@masahi
Copy link
Contributor Author

masahi commented Dec 18, 2021

Not sure why Ci has failed

@hwu36
Copy link
Collaborator

hwu36 commented Dec 18, 2021

These 4 activations look expensive to me. You can try to set kIsHeavy to true, check the generated code is different, and compare the performance to verify it.

If you setting them to true, the epilogue will not be fully unrolled which is good for the I$. Usually complex activations such as gelu need to be set to true.

@masahi
Copy link
Contributor Author

masahi commented Dec 18, 2021

These 4 activations look expensive to me. You can try to set kIsHeavy to true, check the generated code is different, and compare the performance to verify it.

If you setting them to true, the epilogue will not be fully unrolled which is good for the I$. Usually complex activations such as gelu need to be set to true.

My patch respects the current choice for the existing epilogues: isHeavy = true for GELU

and whatever the default for Sigmoid which I assumed was false .
// If the epilogue functor does not define `kIsHeavy` or if it is `false`, then
// the behavior from CUTLASS 2.5 and before is retained. The epilogue is fully
// unrolled and inlined.
//

For the two new activations, my reasoning was false for Silu because it is just sigmoid + mul, and also false for hardswish since it is similar to relu.

@hwu36
Copy link
Collaborator

hwu36 commented Dec 18, 2021

SGTM.

kIsHeavy was newly introduced. We benchmarked gelu, but not sigmoid. If you find sigmoid is expensive in the future, you can create a PR and flip the bool.

@masahi
Copy link
Contributor Author

masahi commented Dec 18, 2021

Making sigmoid and silu isHeavy = true did improve performance on efficientnetv2 slightly: 8.58 msec -> 8.26 msec where 8.58 was the number from my PR apache/tvm#9746. So I pushed that change. The change didn't help when the vectorized variant of sigmoid is used.

@hwu36
Copy link
Collaborator

hwu36 commented Dec 18, 2021

The change didn't help when the vectorized variant of sigmoid is used.

Vectorized one has less instruction and less burden to I$.

@hwu36
Copy link
Collaborator

hwu36 commented Dec 18, 2021

Making sigmoid and silu isHeavy = true did improve performance on efficientnetv2 slightly: 8.58 msec -> 8.26 msec where 8.58 was the number from my PR apache/tvm#9746.

You can update your TVM PR now. :)

@hwu36 hwu36 merged commit 0dc3ba6 into NVIDIA:master Dec 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants