-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Add proper scheduling for dense on CUDA #3923
Conversation
@comaniac please look into ci error |
Could you also add fallback config for autotvm? |
@vinx13 Seems failed a case from NNVM. Will fix it soon. Thanks! |
The above commit:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
28113b9
to
c5ddc78
Compare
c5ddc78
to
a7b605f
Compare
38cdeb6
to
f934a48
Compare
CI error doesn't relate to this PR and it passed locally. Re-run without change. |
@comaniac This is merged, thanks! |
* add proper scheduling for dense on CUDA * add fallback config and fix unit test * fix corner cases * refactoring * fix bias and add testcase * let fusion happen
* add proper scheduling for dense on CUDA * add fallback config and fix unit test * fix corner cases * refactoring * fix bias and add testcase * let fusion happen
* add proper scheduling for dense on CUDA * add fallback config and fix unit test * fix corner cases * refactoring * fix bias and add testcase * let fusion happen
@icemelon9 please review this PR that adds scheduling for dense OP on CUDA since the original scheduling was too basic to achieve reasonable performance.
The added scheduling was modified from topi/recipe/gemm/cuda_gemm_square.py to achieve high performance (6TFlop/s for 2048x2048 dense matrix after AutoTVM tuning). For small batch (<64) dense, we are still based on the original scheduling but just added a parameter (tile_k) for AutoTVM to tune (~370 GFlop/s for batch size 1 dense computation).
One reason to have separate scheduling for different batch size is that I encountered invalid CUDA kernel errors when applying the high performance scheduling to small batch. I think that may be due to invalid splits, but I am not quite sure. You are welcome to comment and suggest improvements.