combined inner outer reduction, add a simple test case #2400

liqiangxl · 2023-02-01T18:44:49Z

Fixes #2399
======= Method if hidden_size > 1024 =======
(1) inner reduciton is a block reduction. Reduction domain is parallized by TIDx and TIDy, Iteration domain is parallized by BIDy.
(2) outer reduction is done in two-steps. The first step is partial reduction, reduction domain is paralled by BIDy, iteration domain is parallized by TIDx and TIDy. The second step is block reduction, the reduciton domain is paralled by TIDy, the iteration domain is parallized by TIDx and BIDy.
======= Method if hidden_size <= 1024 =======
(1) inner reduciton is multi-blocks per reduction. Reduction domain is parallized by TIDx, Iteration domain is parallized by BIDy and TIDy
(2) outer reduction is done in two-steps. The first step is partial reduction, reduction domain is paralled by TIDy, iteration domain is parallized by TIDx and BIDy. The second step is block reduction, the reduciton domain is paralled by TIDx, the iteration domain is parallized by TIDy and BIDy.
======= Performance =======

d0: batch size (x1024), d1: hidden size (x1024), time unit: micro seconds, averaged over 10 times

------------- latest pt2 (2.0.0a0+git45d775c) grabed on Feb 17, 2023 from
------------- gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:test-core-latest

naoyam · 2023-02-01T22:45:16Z

Overall looks good. Adding a standalone test sounds good.

naoyam · 2023-02-17T22:46:24Z

What is the persistent buffer size in the case of inner and outer reductions? The existing routine to calculate the buffer size is, I believe, looks at reduction domains to calculate the minimum required size to do one reduction. Is that the correct size in the case of the combined reduction pattern? It seems that for the inner reduction that seems like an overestimation as not all of the reduction domains are actually reduced for the inner reduction.

naoyam · 2023-02-17T23:21:54Z

I'm also a little concerned if this schedule should be always used for any inner and outer persistent sizes. I was expecting to see some changes to canScheduleRuntime(). Have you checked some extreme cases?

naoyam

Some quick comments

third_party/nvfuser/csrc/scheduler/registry.cpp

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

third_party/nvfuser/csrc/scheduler/normalization.cpp

third_party/nvfuser/csrc/scheduler/reduction_heuristic.h

third_party/nvfuser/csrc/scheduler/reduction_utils.h

liqiangxl · 2023-02-21T22:23:01Z

What is the persistent buffer size in the case of inner and outer reductions? The existing routine to calculate the buffer size is, I believe, looks at reduction domains to calculate the minimum required size to do one reduction. Is that the correct size in the case of the combined reduction pattern? It seems that for the inner reduction that seems like an overestimation as not all of the reduction domains are actually reduced for the inner reduction.

The current persistent buffer detection can't detect the buffer for the partial result of outer reduciton, which is actually a large part of the total persistent buffers.
In the current branch:
For input DataType::Half, the persistent buffers are projected to three inputs (dy, x, weight), total size is 3 * sizeof(half) * dim1
For input DataType::Float the persistent buffers are NOT projected, they are xhat and d_xhat, the total size is 2 * sizeof(float) * dim1
I also tried to disable projection for input DataType::Half, it increased from 123 us to 203 us. But if I enforce projection for input DataType::Float, there is a significiant speedup, e.g. for case 2048 x 10240 the time is reduced from 274 us to 207 us, for case 2048 x 1024 the time is reduced from 39 us to 36 us. The reason is because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows. The projected version needs more registers per thread but it doesn't reduce the occupancy ratio as the all the blocks must be active at the same time for this grid persistent kernel.

So in the revised version, I enforced the projection and the persistent buffers are:

  weight: float T37[((10 * 1) * 4)];
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T47;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T43;
  for(nvfuser_index_t i378 = 0; i378 < (ceilDiv(2048, ((nvfuser_index_t)gridDim.y))); ++i378) {
    dy: float T33[((10 * 1) * 4)];
    x : float T34[((10 * 1) * 4)];
  }

as a comparation if not projected, 236 registers per thread (248 registers per thread if projected):

  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;
  for(nvfuser_index_t i323 = 0; i323 < (ceilDiv(2048, ((nvfuser_index_t)gridDim.y))); ++i323) {
    xhat: float T7[((10 * 1) * 4)];
    d_xhat: float T9[((10 * 1) * 4)];
  }

naoyam · 2023-02-22T01:30:18Z

What is the persistent buffer size in the case of inner and outer reductions? The existing routine to calculate the buffer size is, I believe, looks at reduction domains to calculate the minimum required size to do one reduction. Is that the correct size in the case of the combined reduction pattern? It seems that for the inner reduction that seems like an overestimation as not all of the reduction domains are actually reduced for the inner reduction.

The current persistent buffer detection can't detect the buffer for the partial result of outer reduciton, which is actually a large part of the total persistent buffers.

What do you mean it can't detect the buffer? Does it underestimate the buffer size then?

In the current branch: For input DataType::Half, the persistent buffers are projected to three inputs (dy, x, weight), total size is 3 * sizeof(half) * dim1 For input DataType::Float the persistent buffers are NOT projected, they are xhat and d_xhat, the total size is 2 * sizeof(float) * dim1 I also tried to disable projection for input DataType::Half, it increased from 123 us to 203 us. But if I enforce projection for input DataType::Float, there is a significiant speedup, e.g. for case 2048 x 10240 the time is reduced from 274 us to 207 us, for case 2048 x 1024 the time is reduced from 39 us to 36 us. The reason is because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows. The projected version needs more registers per thread but it doesn't reduce the occupancy ratio as the all the blocks must be active at the same time for this grid persistent kernel.

So, when input is Float and is projected, the weight tensor also becomes persistent and improves the performance? I don't think this is an intended consequence of the buffer projection, so we should understand why and think if it could be made more generic.

liqiangxl · 2023-02-23T00:50:33Z

I mean when calculating persistent buffer size, the code can't detect the following two buffers. So yes, there is an under estimation.

  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;

Enforce projection to input can improve performance because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows.

naoyam · 2023-02-23T17:39:46Z

I mean when calculating persistent buffer size, the code can't detect the following two buffers. So yes, there is an under estimation.
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;

Don't we need to account for the buffers to decide if the persistent scheduler should be picked? This is why I also asked:

I'm also a little concerned if this schedule should be always used for any inner and outer persistent sizes. I was expecting to see some changes to canScheduleRuntime(). Have you checked some extreme cases?

Enforce projection to input can improve performance because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows.

Right, so let me rephrase my question. Is this specific to this inner-outer scheduler? If not, how can we make it more general? Should we always use projected buffers?

liqiangxl · 2023-03-02T18:46:55Z

I mean when calculating persistent buffer size, the code can't detect the following two buffers. So yes, there is an under estimation.
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;
Don't we need to account for the buffers to decide if the persistent scheduler should be picked? This is why I also asked:

I'm also a little concerned if this schedule should be always used for any inner and outer persistent sizes. I was expecting to see some changes to canScheduleRuntime(). Have you checked some extreme cases?

Enforce projection to input can improve performance because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows.

Right, so let me rephrase my question. Is this specific to this inner-outer scheduler? If not, how can we make it more general? Should we always use projected buffers?

I checked all the benckmarks and there are regressions if hidden size is >= 16K with float. In the revised branch, buffer size of outer reductions is added:
persistent_buffer_size += scheduler_utils::partialReductionBufferSize( outer_reduction_tvs, runtime_info);
To allow the use of this combined approach, available_persistent_buffer_size is increased from half of all the registers to all the 64K registers. This will allow float hidden size <= 14K to pass canScheduleRunTime check. This lead to register spills but the performance is still faster (860 GB/s to 660 GB/s) than the segmented version. To avoid register spill, multi-blocks per row shoud be used. This is probabaly not urgent as popular hidden size is usually <= 10K.

naoyam · 2023-03-08T05:41:04Z

I mean when calculating persistent buffer size, the code can't detect the following two buffers. So yes, there is an under estimation.
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T38;
  partial_result_of_outer_reduction: Array<float, ((10 * 1) * 4), 4> T34;
Don't we need to account for the buffers to decide if the persistent scheduler should be picked? This is why I also asked:

I'm also a little concerned if this schedule should be always used for any inner and outer persistent sizes. I was expecting to see some changes to canScheduleRuntime(). Have you checked some extreme cases?

Enforce projection to input can improve performance because weight is shared across different rows. If we keep it persistent, we don't need to reload it in the iteration over different rows.

Right, so let me rephrase my question. Is this specific to this inner-outer scheduler? If not, how can we make it more general? Should we always use projected buffers?
I checked all the benckmarks and there are regressions if hidden size is >= 16K with float. In the revised branch, buffer size of outer reductions is added: persistent_buffer_size += scheduler_utils::partialReductionBufferSize( outer_reduction_tvs, runtime_info); To allow the use of this combined approach, available_persistent_buffer_size is increased from half of all the registers to all the 64K registers. This will allow float hidden size <= 14K to pass canScheduleRunTime check. This lead to register spills but the performance is still faster (860 GB/s to 660 GB/s) than the segmented version. To avoid register spill, multi-blocks per row shoud be used. This is probabaly not urgent as popular hidden size is usually <= 10K.

Thanks for checking the performance. The size check makes sense to me. Register usage and its perf impact is difficult to predict, but looks like it's a reasonable heuristic. I'll revisit and review the PR.

Have you also run the benchmarks on other devices such as V100?

liqiangxl · 2023-03-08T12:53:03Z

I checked all the benckmarks and there are regressions if hidden size is >= 16K with float.

This is for both V100 and A100.

third_party/nvfuser/csrc/scheduler/utils.h

third_party/nvfuser/csrc/scheduler/utils.cpp

third_party/nvfuser/csrc/scheduler/reduction_utils.h

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

third_party/nvfuser/csrc/scheduler/reduction_utils.h

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

third_party/nvfuser/csrc/scheduler/utils.h

third_party/nvfuser/test/test_gpu_combined_inner_outer_reduction.cpp

third_party/nvfuser/csrc/scheduler/reduction_utils.h

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

third_party/nvfuser/csrc/scheduler/reduction_heuristic.h

naoyam

I gave up reviewing normalization.cpp. Please, please add more comments.

third_party/nvfuser/csrc/scheduler/reduction_heuristic.h

third_party/nvfuser/csrc/scheduler/registry.cpp

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

naoyam · 2023-03-13T22:53:32Z

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

@@ -140,7 +144,13 @@ TensorView* scheduleReductionTV(
      outer_unroll(outer_i++, rparams.unroll_factor_inner_reduction);
    }

-    reduction_tv->axis(outer_i)->parallelize(rparams.block_dim_inner_reduction);
+    if (rparams.combined_inner_outer && !rparams.multiple_reds_per_blk) {


In the above, there's additional condition, rparams.lparams.bdimx() > 1. Why not used here?

Also, the code would look easier to understand if there's only one place to use rpams.block_dim_inner_reduction (within this if block fo rparams.persistent_kernel).

In the above, there's additional condition, rparams.lparams.bdimx() > 1. Why not used here?

This condition shouldn't exist. I can't remember why it was there originally. Maybe I was doing some debugging.

Also, the code would look easier to understand if there's only one place to use rpams.block_dim_inner_reduction (within this if block fo rparams.persistent_kernel).

rpams.block_dim_inner_reduction is used to split the reduction dim by NamedScalar::getParallelDim(ptype) for combined reduction with single reduction per block. For regular reduction, it is only used to parallel the reduction dim. So it will appear twice in the code.

I was wondering if this could also work:

Suggested change

if (rparams.combined_inner_outer && !rparams.multiple_reds_per_blk) {

// delete the above use of block_dim_inner_reduction

...

if (rparams.combined_inner_outer && !rparams.multiple_reds_per_blk) {

inner_parallel(outer_i, rparams.block_dim_inner_reduction);

reduction_tv->axis(outer_i)->parallelize(

rparams.block_dim_inner_reduction_extra);

} else {

reduction_tv->axis(outer_i)->parallelize(

rparams.block_dim_inner_reduction);

}

This way, I think it's more apparent how we use block_dim_inner_reduction and block_inner_dim_reduction_extra.

third_party/nvfuser/csrc/scheduler/normalization.cpp

third_party/nvfuser/csrc/scheduler/normalization_utils.h

third_party/nvfuser/csrc/scheduler/normalization_utils.cpp

naoyam

Just comments on hasSharedConsumerNonReductionProducer

third_party/nvfuser/csrc/scheduler/normalization_utils.cpp

third_party/nvfuser/csrc/scheduler/registry.cpp

third_party/nvfuser/csrc/scheduler/normalization.cpp

liqiangxl · 2023-04-13T20:52:32Z

#2400 (comment)

  if (rparams.vectorization_factor_tmp_gmem_read > 1) {
    for (auto tv_pair : cached_outputs) {
      if (tv_pair.second->axis(-1)->getParallelType() !=
          ParallelType::Vectorize) {
        tv_pair.second->axis(-1)->parallelize(ParallelType::Vectorize);
      }
    }
  }

The above code have been removed in the revised version. All the cached_outputs are correctly vectorized if I correctly set vectorization in scheduleReductionCombinedOuter, previously was set to unroll.

third_party/nvfuser/csrc/scheduler/normalization_utils.cpp

third_party/nvfuser/csrc/scheduler/normalization.cpp

naoyam

Finally approved! 👏

Great work, Liqiang. This was complicated scheduling, and I believe the PR was able to cleanly integrates it into the existing scheduler.

Please rerun the benchmarks and make sure all the performance improvements of LN backward are still there. Then we need to port this to the new repo. I think there won't be significant conflicts as this PR is mostly on the persistent scheduler, and I don't think there's any recent major change. If it's possible, it would be great if you could keep the history of changes you'd need to do for the porting, but I'm not sure if that's easily done.

liqiangxl requested review from naoyam and zasdfgbnm and removed request for zasdfgbnm February 1, 2023 18:45

liqiangxl force-pushed the llu/ln_backward_merge branch 3 times, most recently from f895e3b to 9b4d1b3 Compare February 8, 2023 18:02

liqiangxl force-pushed the llu/ln_backward_merge branch 5 times, most recently from 9710346 to 65d2358 Compare February 17, 2023 16:55

liqiangxl marked this pull request as ready for review February 17, 2023 17:43

naoyam reviewed Feb 17, 2023

View reviewed changes

liqiangxl force-pushed the llu/ln_backward_merge branch from bc67f13 to 105d93a Compare March 2, 2023 18:33

liqiangxl force-pushed the llu/ln_backward_merge branch from 105d93a to 19a45f6 Compare March 7, 2023 18:11

naoyam reviewed Mar 8, 2023

View reviewed changes

naoyam reviewed Mar 12, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/reduction_heuristic.h Outdated Show resolved Hide resolved

third_party/nvfuser/csrc/scheduler/reduction_heuristic.h Outdated Show resolved Hide resolved

third_party/nvfuser/csrc/scheduler/reduction_heuristic.h Outdated Show resolved Hide resolved

liqiangxl force-pushed the llu/ln_backward_merge branch from 8c07f20 to 1d4cab9 Compare March 13, 2023 16:25

naoyam reviewed Mar 14, 2023

View reviewed changes

check vect factor of tmp buffer

5a9e814

naoyam reviewed Apr 3, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization_utils.h Outdated Show resolved Hide resolved

naoyam reviewed Apr 4, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization_utils.cpp Outdated Show resolved Hide resolved

check shared consumer

97779cd

naoyam reviewed Apr 4, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization_utils.cpp Outdated Show resolved Hide resolved

naoyam mentioned this pull request Apr 4, 2023

IterDomain::parallelize(ParallelType) should assert the IterDomain is a leaf domain NVIDIA/Fuser#127

Closed

liqiangxl added 3 commits April 5, 2023 07:31

dependency check

79c6df1

check consumer's shared non-reduction producer

f1f9413

use hasReduction api

93c4cd9

naoyam reviewed Apr 12, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization_utils.cpp Outdated Show resolved Hide resolved

third_party/nvfuser/csrc/scheduler/normalization_utils.cpp Outdated Show resolved Hide resolved

third_party/nvfuser/csrc/scheduler/normalization_utils.cpp Outdated Show resolved Hide resolved

naoyam reviewed Apr 12, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/registry.cpp Outdated Show resolved Hide resolved

third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved

disjointset and vect

be11c7d

liqiangxl force-pushed the llu/ln_backward_merge branch from 8f67725 to be11c7d Compare April 13, 2023 18:56

liqiangxl added 2 commits April 17, 2023 13:56

detect links through consumer's producer

2a82abc

revise disjoint set check

9b39928

naoyam reviewed Apr 18, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization.cpp Show resolved Hide resolved

naoyam reviewed Apr 18, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved

naoyam reviewed Apr 19, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization.cpp Show resolved Hide resolved

liqiangxl added 3 commits April 19, 2023 06:34

revise disjoint set check

f4a5014

rename to vectorization_factor_outer

61595ed

propagate vectorization to cached_gmem_reload

609089b

naoyam reviewed Apr 19, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved

naoyam reviewed Apr 19, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization.cpp Outdated Show resolved Hide resolved

add findVectorizedOutputOf

0d62871

naoyam reviewed Apr 20, 2023

View reviewed changes

third_party/nvfuser/csrc/scheduler/normalization.cpp Show resolved Hide resolved

naoyam approved these changes Apr 20, 2023

View reviewed changes

liqiangxl mentioned this pull request Apr 22, 2023

Llu/ln bwd NVIDIA/Fuser#207

Merged

naoyam mentioned this pull request May 23, 2023

Cast opt pass NVIDIA/Fuser#355

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combined inner outer reduction, add a simple test case #2400

combined inner outer reduction, add a simple test case #2400

liqiangxl commented Feb 1, 2023 •

edited

Loading

naoyam commented Feb 1, 2023

naoyam commented Feb 17, 2023

naoyam commented Feb 17, 2023

naoyam left a comment

liqiangxl commented Feb 21, 2023

naoyam commented Feb 22, 2023

liqiangxl commented Feb 23, 2023

naoyam commented Feb 23, 2023

liqiangxl commented Mar 2, 2023

naoyam commented Mar 8, 2023

liqiangxl commented Mar 8, 2023 •

edited

Loading

naoyam left a comment

naoyam Mar 13, 2023

naoyam Mar 13, 2023

liqiangxl Mar 15, 2023

liqiangxl Mar 15, 2023

naoyam Mar 16, 2023

naoyam left a comment

liqiangxl commented Apr 13, 2023

naoyam left a comment

combined inner outer reduction, add a simple test case #2400

Are you sure you want to change the base?

combined inner outer reduction, add a simple test case #2400

Conversation

liqiangxl commented Feb 1, 2023 • edited Loading

naoyam commented Feb 1, 2023

naoyam commented Feb 17, 2023

naoyam commented Feb 17, 2023

naoyam left a comment

Choose a reason for hiding this comment

liqiangxl commented Feb 21, 2023

naoyam commented Feb 22, 2023

liqiangxl commented Feb 23, 2023

naoyam commented Feb 23, 2023

liqiangxl commented Mar 2, 2023

naoyam commented Mar 8, 2023

liqiangxl commented Mar 8, 2023 • edited Loading

naoyam left a comment

Choose a reason for hiding this comment

naoyam Mar 13, 2023

Choose a reason for hiding this comment

naoyam Mar 13, 2023

Choose a reason for hiding this comment

liqiangxl Mar 15, 2023

Choose a reason for hiding this comment

liqiangxl Mar 15, 2023

Choose a reason for hiding this comment

naoyam Mar 16, 2023

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

liqiangxl commented Apr 13, 2023

naoyam left a comment

Choose a reason for hiding this comment

liqiangxl commented Feb 1, 2023 •

edited

Loading

liqiangxl commented Mar 8, 2023 •

edited

Loading