Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combined inner outer reduction, add a simple test case #2400

Open
wants to merge 14 commits into
base: devel
Choose a base branch
from

Conversation

liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Feb 1, 2023

Fixes #2399
======= Method if hidden_size > 1024 =======
(1) inner reduciton is a block reduction. Reduction domain is parallized by TIDx and TIDy, Iteration domain is parallized by BIDy.
(2) outer reduction is done in two-steps. The first step is partial reduction, reduction domain is paralled by BIDy, iteration domain is parallized by TIDx and TIDy. The second step is block reduction, the reduciton domain is paralled by TIDy, the iteration domain is parallized by TIDx and BIDy.
======= Method if hidden_size <= 1024 =======
(1) inner reduciton is multi-blocks per reduction. Reduction domain is parallized by TIDx, Iteration domain is parallized by BIDy and TIDy
(2) outer reduction is done in two-steps. The first step is partial reduction, reduction domain is paralled by TIDy, iteration domain is parallized by TIDx and BIDy. The second step is block reduction, the reduciton domain is paralled by TIDx, the iteration domain is parallized by TIDy and BIDy.
======= Performance =======
image
image
image
d0: batch size (x1024), d1: hidden size (x1024), time unit: micro seconds, averaged over 10 times
image
------------- latest pt2 (2.0.0a0+git45d775c) grabed on Feb 17, 2023 from
------------- gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:test-core-latest
image
image

@liqiangxl liqiangxl requested review from naoyam and zasdfgbnm and removed request for zasdfgbnm February 1, 2023 18:45
@naoyam
Copy link
Collaborator

naoyam commented Feb 1, 2023

Overall looks good. Adding a standalone test sounds good.

@liqiangxl liqiangxl force-pushed the llu/ln_backward_merge branch 3 times, most recently from f895e3b to 9b4d1b3 Compare February 8, 2023 18:02
@liqiangxl liqiangxl force-pushed the llu/ln_backward_merge branch 5 times, most recently from 9710346 to 65d2358 Compare February 17, 2023 16:55
@liqiangxl liqiangxl marked this pull request as ready for review February 17, 2023 17:43
@naoyam
Copy link
Collaborator

naoyam commented Feb 17, 2023

What is the persistent buffer size in the case of inner and outer reductions? The existing routine to calculate the buffer size is, I believe, looks at reduction domains to calculate the minimum required size to do one reduction. Is that the correct size in the case of the combined reduction pattern? It seems that for the inner reduction that seems like an overestimation as not all of the reduction domains are actually reduced for the inner reduction.

@naoyam
Copy link
Collaborator

naoyam commented Feb 17, 2023

I'm also a little concerned if this schedule should be always used for any inner and outer persistent sizes. I was expecting to see some changes to canScheduleRuntime(). Have you checked some extreme cases?

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments

third_party/nvfuser/csrc/scheduler/registry.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/registry.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/reduction_utils.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/reduction_heuristic.h Outdated Show resolved Hide resolved
@liqiangxl
Copy link
Collaborator Author

What is the persistent buffer size in the case of inner and outer reductions? The existing routine to calculate the buffer size is, I believe, looks at reduction domains to calculate the minimum required size to do one reduction. Is that the correct size in the case of the combined reduction pattern? It seems that for the inner reduction that seems like an overestimation as not all of the reduction domains are actually reduced for the inner reduction.

The current persistent buffer detection can't detect the buffer for the partial result of outer reduciton, which is actually a large part of the total persistent buffers.
In the current branch:
For input DataType::Half, the persistent buffers are projected to three inputs (dy, x, weight), total size is 3 * sizeof(half) * dim1
For input DataType::Float the persistent buffers are NOT projected, they are xhat and d_xhat, the total size is 2 * sizeof(float) * dim1
I also tried to disable projection for input DataType::Half, it increased from 123 us to 203 us. But if I enforce projection for input DataType::Float, there is a significiant speedup, e.g. for case 2048 x 10240 the time is reduced from 274 us to 207 us, for case 2048 x 1024 the time is reduced from 39 us to 36 us. The reason is because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows. The projected version needs more registers per thread but it doesn't reduce the occupancy ratio as the all the blocks must be active at the same time for this grid persistent kernel.

So in the revised version, I enforced the projection and the persistent buffers are:

  weight: float T37[((10 * 1) * 4)];
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T47;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T43;
  for(nvfuser_index_t i378 = 0; i378 < (ceilDiv(2048, ((nvfuser_index_t)gridDim.y))); ++i378) {
    dy: float T33[((10 * 1) * 4)];
    x : float T34[((10 * 1) * 4)];
  }

as a comparation if not projected, 236 registers per thread (248 registers per thread if projected):

  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;
  for(nvfuser_index_t i323 = 0; i323 < (ceilDiv(2048, ((nvfuser_index_t)gridDim.y))); ++i323) {
    xhat: float T7[((10 * 1) * 4)];
    d_xhat: float T9[((10 * 1) * 4)];
  }

@naoyam
Copy link
Collaborator

naoyam commented Feb 22, 2023

What is the persistent buffer size in the case of inner and outer reductions? The existing routine to calculate the buffer size is, I believe, looks at reduction domains to calculate the minimum required size to do one reduction. Is that the correct size in the case of the combined reduction pattern? It seems that for the inner reduction that seems like an overestimation as not all of the reduction domains are actually reduced for the inner reduction.

The current persistent buffer detection can't detect the buffer for the partial result of outer reduciton, which is actually a large part of the total persistent buffers.

What do you mean it can't detect the buffer? Does it underestimate the buffer size then?

In the current branch: For input DataType::Half, the persistent buffers are projected to three inputs (dy, x, weight), total size is 3 * sizeof(half) * dim1 For input DataType::Float the persistent buffers are NOT projected, they are xhat and d_xhat, the total size is 2 * sizeof(float) * dim1 I also tried to disable projection for input DataType::Half, it increased from 123 us to 203 us. But if I enforce projection for input DataType::Float, there is a significiant speedup, e.g. for case 2048 x 10240 the time is reduced from 274 us to 207 us, for case 2048 x 1024 the time is reduced from 39 us to 36 us. The reason is because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows. The projected version needs more registers per thread but it doesn't reduce the occupancy ratio as the all the blocks must be active at the same time for this grid persistent kernel.

So, when input is Float and is projected, the weight tensor also becomes persistent and improves the performance? I don't think this is an intended consequence of the buffer projection, so we should understand why and think if it could be made more generic.

@liqiangxl
Copy link
Collaborator Author

I mean when calculating persistent buffer size, the code can't detect the following two buffers. So yes, there is an under estimation.

  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;

Enforce projection to input can improve performance because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows.

@naoyam
Copy link
Collaborator

naoyam commented Feb 23, 2023

I mean when calculating persistent buffer size, the code can't detect the following two buffers. So yes, there is an under estimation.

  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;

Don't we need to account for the buffers to decide if the persistent scheduler should be picked? This is why I also asked:

I'm also a little concerned if this schedule should be always used for any inner and outer persistent sizes. I was expecting to see some changes to canScheduleRuntime(). Have you checked some extreme cases?

Enforce projection to input can improve performance because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows.

Right, so let me rephrase my question. Is this specific to this inner-outer scheduler? If not, how can we make it more general? Should we always use projected buffers?

@liqiangxl
Copy link
Collaborator Author

I mean when calculating persistent buffer size, the code can't detect the following two buffers. So yes, there is an under estimation.

  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;

Don't we need to account for the buffers to decide if the persistent scheduler should be picked? This is why I also asked:

I'm also a little concerned if this schedule should be always used for any inner and outer persistent sizes. I was expecting to see some changes to canScheduleRuntime(). Have you checked some extreme cases?

Enforce projection to input can improve performance because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows.

Right, so let me rephrase my question. Is this specific to this inner-outer scheduler? If not, how can we make it more general? Should we always use projected buffers?

I checked all the benckmarks and there are regressions if hidden size is >= 16K with float. In the revised branch, buffer size of outer reductions is added:
persistent_buffer_size += scheduler_utils::partialReductionBufferSize( outer_reduction_tvs, runtime_info);
To allow the use of this combined approach, available_persistent_buffer_size is increased from half of all the registers to all the 64K registers. This will allow float hidden size <= 14K to pass canScheduleRunTime check. This lead to register spills but the performance is still faster (860 GB/s to 660 GB/s) than the segmented version. To avoid register spill, multi-blocks per row shoud be used. This is probabaly not urgent as popular hidden size is usually <= 10K.

@naoyam
Copy link
Collaborator

naoyam commented Mar 8, 2023

I mean when calculating persistent buffer size, the code can't detect the following two buffers. So yes, there is an under estimation.

  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;

Don't we need to account for the buffers to decide if the persistent scheduler should be picked? This is why I also asked:

I'm also a little concerned if this schedule should be always used for any inner and outer persistent sizes. I was expecting to see some changes to canScheduleRuntime(). Have you checked some extreme cases?

Enforce projection to input can improve performance because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows.

Right, so let me rephrase my question. Is this specific to this inner-outer scheduler? If not, how can we make it more general? Should we always use projected buffers?

I checked all the benckmarks and there are regressions if hidden size is >= 16K with float. In the revised branch, buffer size of outer reductions is added: persistent_buffer_size += scheduler_utils::partialReductionBufferSize( outer_reduction_tvs, runtime_info); To allow the use of this combined approach, available_persistent_buffer_size is increased from half of all the registers to all the 64K registers. This will allow float hidden size <= 14K to pass canScheduleRunTime check. This lead to register spills but the performance is still faster (860 GB/s to 660 GB/s) than the segmented version. To avoid register spill, multi-blocks per row shoud be used. This is probabaly not urgent as popular hidden size is usually <= 10K.

Thanks for checking the performance. The size check makes sense to me. Register usage and its perf impact is difficult to predict, but looks like it's a reasonable heuristic. I'll revisit and review the PR.

Have you also run the benchmarks on other devices such as V100?

@liqiangxl
Copy link
Collaborator Author

liqiangxl commented Mar 8, 2023

I checked all the benckmarks and there are regressions if hidden size is >= 16K with float.

This is for both V100 and A100.

third_party/nvfuser/csrc/scheduler/utils.h Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/utils.h Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/utils.h Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/utils.h Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/utils.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/reduction_utils.h Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/reduction_utils.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/reduction_utils.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave up reviewing normalization.cpp. Please, please add more comments.

third_party/nvfuser/csrc/scheduler/reduction_heuristic.h Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/reduction_heuristic.h Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/reduction_utils.cpp Outdated Show resolved Hide resolved
@@ -140,7 +144,13 @@ TensorView* scheduleReductionTV(
outer_unroll(outer_i++, rparams.unroll_factor_inner_reduction);
}

reduction_tv->axis(outer_i)->parallelize(rparams.block_dim_inner_reduction);
if (rparams.combined_inner_outer && !rparams.multiple_reds_per_blk) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above, there's additional condition, rparams.lparams.bdimx() > 1. Why not used here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the code would look easier to understand if there's only one place to use rpams.block_dim_inner_reduction (within this if block fo rparams.persistent_kernel).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above, there's additional condition, rparams.lparams.bdimx() > 1. Why not used here?

This condition shouldn't exist. I can't remember why it was there originally. Maybe I was doing some debugging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the code would look easier to understand if there's only one place to use rpams.block_dim_inner_reduction (within this if block fo rparams.persistent_kernel).

rpams.block_dim_inner_reduction is used to split the reduction dim by NamedScalar::getParallelDim(ptype) for combined reduction with single reduction per block. For regular reduction, it is only used to parallel the reduction dim. So it will appear twice in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this could also work:

Suggested change
if (rparams.combined_inner_outer && !rparams.multiple_reds_per_blk) {
// delete the above use of block_dim_inner_reduction
...
if (rparams.combined_inner_outer && !rparams.multiple_reds_per_blk) {
inner_parallel(outer_i, rparams.block_dim_inner_reduction);
reduction_tv->axis(outer_i)->parallelize(
rparams.block_dim_inner_reduction_extra);
} else {
reduction_tv->axis(outer_i)->parallelize(
rparams.block_dim_inner_reduction);
}

This way, I think it's more apparent how we use block_dim_inner_reduction and block_inner_dim_reduction_extra.

third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just comments on hasSharedConsumerNonReductionProducer

@liqiangxl
Copy link
Collaborator Author

#2400 (comment)

  if (rparams.vectorization_factor_tmp_gmem_read > 1) {
    for (auto tv_pair : cached_outputs) {
      if (tv_pair.second->axis(-1)->getParallelType() !=
          ParallelType::Vectorize) {
        tv_pair.second->axis(-1)->parallelize(ParallelType::Vectorize);
      }
    }
  }

The above code have been removed in the revised version. All the cached_outputs are correctly vectorized if I correctly set vectorization in scheduleReductionCombinedOuter, previously was set to unroll.

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally approved! 👏

Great work, Liqiang. This was complicated scheduling, and I believe the PR was able to cleanly integrates it into the existing scheduler.

Please rerun the benchmarks and make sure all the performance improvements of LN backward are still there. Then we need to port this to the new repo. I think there won't be significant conflicts as this PR is mostly on the persistent scheduler, and I don't think there's any recent major change. If it's possible, it would be great if you could keep the history of changes you'd need to do for the porting, but I'm not sure if that's easily done.

@liqiangxl liqiangxl mentioned this pull request Apr 22, 2023
@naoyam naoyam mentioned this pull request May 23, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

combined inner outer reduction used in layer norm backward
2 participants