-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
combined inner outer reduction, add a simple test case #2400
base: devel
Are you sure you want to change the base?
Conversation
Overall looks good. Adding a standalone test sounds good. |
f895e3b
to
9b4d1b3
Compare
9710346
to
65d2358
Compare
What is the persistent buffer size in the case of inner and outer reductions? The existing routine to calculate the buffer size is, I believe, looks at reduction domains to calculate the minimum required size to do one reduction. Is that the correct size in the case of the combined reduction pattern? It seems that for the inner reduction that seems like an overestimation as not all of the reduction domains are actually reduced for the inner reduction. |
I'm also a little concerned if this schedule should be always used for any inner and outer persistent sizes. I was expecting to see some changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick comments
The current persistent buffer detection can't detect the buffer for the partial result of outer reduciton, which is actually a large part of the total persistent buffers. So in the revised version, I enforced the projection and the persistent buffers are:
as a comparation if not projected, 236 registers per thread (248 registers per thread if projected):
|
What do you mean it can't detect the buffer? Does it underestimate the buffer size then?
So, when input is |
I mean when calculating persistent buffer size, the code can't detect the following two buffers. So yes, there is an under estimation.
Enforce projection to input can improve performance because |
Don't we need to account for the buffers to decide if the persistent scheduler should be picked? This is why I also asked:
Right, so let me rephrase my question. Is this specific to this inner-outer scheduler? If not, how can we make it more general? Should we always use projected buffers? |
bc67f13
to
105d93a
Compare
I checked all the benckmarks and there are regressions if hidden size is >= 16K with float. In the revised branch, buffer size of outer reductions is added: |
105d93a
to
19a45f6
Compare
Thanks for checking the performance. The size check makes sense to me. Register usage and its perf impact is difficult to predict, but looks like it's a reasonable heuristic. I'll revisit and review the PR. Have you also run the benchmarks on other devices such as V100? |
This is for both V100 and A100. |
8c07f20
to
1d4cab9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave up reviewing normalization.cpp. Please, please add more comments.
@@ -140,7 +144,13 @@ TensorView* scheduleReductionTV( | |||
outer_unroll(outer_i++, rparams.unroll_factor_inner_reduction); | |||
} | |||
|
|||
reduction_tv->axis(outer_i)->parallelize(rparams.block_dim_inner_reduction); | |||
if (rparams.combined_inner_outer && !rparams.multiple_reds_per_blk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above, there's additional condition, rparams.lparams.bdimx() > 1
. Why not used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the code would look easier to understand if there's only one place to use rpams.block_dim_inner_reduction
(within this if block fo rparams.persistent_kernel
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above, there's additional condition,
rparams.lparams.bdimx() > 1
. Why not used here?
This condition shouldn't exist. I can't remember why it was there originally. Maybe I was doing some debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the code would look easier to understand if there's only one place to use
rpams.block_dim_inner_reduction
(within this if block forparams.persistent_kernel
).
rpams.block_dim_inner_reduction
is used to split the reduction dim by NamedScalar::getParallelDim(ptype)
for combined reduction with single reduction per block. For regular reduction, it is only used to parallel the reduction dim. So it will appear twice in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if this could also work:
if (rparams.combined_inner_outer && !rparams.multiple_reds_per_blk) { | |
// delete the above use of block_dim_inner_reduction | |
... | |
if (rparams.combined_inner_outer && !rparams.multiple_reds_per_blk) { | |
inner_parallel(outer_i, rparams.block_dim_inner_reduction); | |
reduction_tv->axis(outer_i)->parallelize( | |
rparams.block_dim_inner_reduction_extra); | |
} else { | |
reduction_tv->axis(outer_i)->parallelize( | |
rparams.block_dim_inner_reduction); | |
} |
This way, I think it's more apparent how we use block_dim_inner_reduction
and block_inner_dim_reduction_extra
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just comments on hasSharedConsumerNonReductionProducer
8f67725
to
be11c7d
Compare
The above code have been removed in the revised version. All the cached_outputs are correctly vectorized if I correctly set vectorization in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally approved! 👏
Great work, Liqiang. This was complicated scheduling, and I believe the PR was able to cleanly integrates it into the existing scheduler.
Please rerun the benchmarks and make sure all the performance improvements of LN backward are still there. Then we need to port this to the new repo. I think there won't be significant conflicts as this PR is mostly on the persistent scheduler, and I don't think there's any recent major change. If it's possible, it would be great if you could keep the history of changes you'd need to do for the porting, but I'm not sure if that's easily done.
Fixes #2399
======= Method if hidden_size > 1024 =======
(1) inner reduciton is a block reduction. Reduction domain is parallized by TIDx and TIDy, Iteration domain is parallized by BIDy.
(2) outer reduction is done in two-steps. The first step is partial reduction, reduction domain is paralled by BIDy, iteration domain is parallized by TIDx and TIDy. The second step is block reduction, the reduciton domain is paralled by TIDy, the iteration domain is parallized by TIDx and BIDy.
======= Method if hidden_size <= 1024 =======
(1) inner reduciton is multi-blocks per reduction. Reduction domain is parallized by TIDx, Iteration domain is parallized by BIDy and TIDy
(2) outer reduction is done in two-steps. The first step is partial reduction, reduction domain is paralled by TIDy, iteration domain is parallized by TIDx and BIDy. The second step is block reduction, the reduciton domain is paralled by TIDx, the iteration domain is parallized by TIDy and BIDy.
======= Performance =======
d0: batch size (x1024), d1: hidden size (x1024), time unit: micro seconds, averaged over 10 times
------------- latest pt2 (2.0.0a0+git45d775c) grabed on Feb 17, 2023 from
-------------
gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:test-core-latest