Vectorized serial grid reduction #1528

jacobhinkle · 2023-12-14T18:08:43Z

This change allows us to use vectorized loads/stores in serialReductionStep. The generated kernel now looks like

  NVFUSER_UPDATE_MAGIC_ZERO;                                        
  grid_sync::blockSerializeWait<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]);
  #pragma unroll                                                                                                                         
  for(nvfuser_index_t i16 = 0; i16 < 4LL; ++i16) {                                                                                           nvfuser_index_t i17;                                                                                                                 
    i17 = 32LL * i16;                                                                                                                        nvfuser_index_t i18;                                                                                                                 
    i18 = 4096LL * i16;                                                                                                                  
    nvfuser_index_t i19;                                                                                                                 
    i19 = i5 + i18;                                                                                                                      
    nvfuser_index_t i20;                                                                                                                 
    i20 = -i18;                                                                                                                          
    #pragma unroll                                                                                                                       
    for(nvfuser_index_t i21 = 0; i21 < 8LL; ++i21) {                                                                                     
      nvfuser_index_t i22;                                                                                                               
      i22 = 512LL * (i21 + nvfuser_zero);                                                                                                
      Array<float, 4LL, 4> T3;                                                                                                           
      T3.set(float(0.000000000e+00f));                                                                                                   
      reduction::serialReductionStep</*vec_size=*/4>(                                                                                    
        &T3[0LL],                                                                                                                        
        &T2[(i17 + (4LL * i21))],                                                                                                        
        0.000000000e+00f,                                                                                                                
        &T6[(i19 + i22)],                                                                                                                
        [](float &a, float b) { a = a + b; },                                                                                            
        index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0,
        index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1,
        true,                                                                                                                                    true);                                                                                                                           
      if ((b7 && (i6 < (i20 - i22)))) {                                                                                                  
        loadLocalToGlobal<float, /*vec_size=*/4, /*is_volatile=*/false>( &T1[(i19 + i22)], &T3[0LL]);                                    
      }                                                                                                                                  
    }                                                                                                                                    
  }                                                                                                                                      
  grid_sync::blockSerializeRelease<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]);            
  NVFUSER_UPDATE_MAGIC_ZERO;

test/test_serial_gridreduce.cpp

jacobhinkle · 2024-01-09T20:29:29Z

test/test_serial_gridreduce.cpp

        tv3->axis(0)->parallelize(ParallelType::BIDx);
        tv3->axis(1)->parallelize(ParallelType::BIDy);
        tv3->axis(4)->parallelize(ParallelType::TIDy);
        tv3->axis(5)->parallelize(ParallelType::TIDx);
-        tv3->axis(6)->parallelize(ParallelType::BIDz);
+        tv3->axis(6)->parallelize(ParallelType::Vectorize);


ParallelType::Group might be a better choice here to signal vectorized serial reduction. I'll experiment with that. Originally I worried that lowering would detect it and convert the node to a GroupedGridReduction but I think that might be the proper way to handle it actually.

I am actually going to keep this as Vectorize since there is another good use for Group in this context. When we use split-K in a bmm, the batch dimension is no longer parallelized as it would be without split-K. In this case we have an outer loop in the batch dim that includes a separate main loop for each row in the batch dimension. Our current approach of syncing around all non-trivial loops for serial reduction is insufficient in that case since it means we are actually serializing the entire kernel instead of just each output stage. We could explicitly mark the inner dimensions as ParallelType::Group in order to signal that we actually want to sync inside that outer loop.

naoyam

I think this is fine but also wonder if we could represent the serial grid reduction code with KIR expressions. The serial reduction part looks like just a normal reduction with writes to a global buffer. We would need a special sync IR node after that, but it seems possible to represent the computation part in KIR. If that's the case, we would just need to vectorize the loads and stores as we would do normally.

naoyam · 2024-01-09T22:28:09Z

csrc/device_lower/validation.cpp

-          tv->definition() == nullptr || tv->definition()->isA<LoadStoreOp>() ||
-              tv->definition()->isA<SliceOp>(),
+          def == nullptr || def->isA<LoadStoreOp>() ||
+              def->isA<SliceOp>() || (def->isA<ReductionOp>()), // && def->as<ReductionOp>()->SerialGridReductionRequested(),


Is the && part supposed to be commented out?

That needs to be uncommented once #1456 is merged, since that introduces the method SerialGridReductionRequested(). For now it is just allowing reductions.

I am starting to doubt using vectorize for this though. It would be cleaner to use Group for the vectorized reduction output. Then as you say, if I lower to kernel IR instead of using a helper function, we could convert that to vectorized LoadStoreOps with CacheOp::Global then the default validation checks would work out I think. I will mark this draft while I give that a shot.

lower to kernel IR instead of using a helper function

Just want to make sure that may be desirable but not necessary. Prototyping is typically easier with helper runtime functions than generating in KIR.

I think it is desirable, since we'd get a lot of stuff for free like vectorization, inserting allocations, and hoisting predicates. I've been trying a bit to get it working as a single lowering pass but generating IR for the work buffer is challenging because it appears we need to introduce a cycle into the graph; we load from the work buffer then write back to it. I am about to try making the input and output global work buffers into separate TensorViews and forcing the output buffer to alias the input. I'll push a PR soon to show what I mean.

runtime/grid_reduction.cu

naoyam · 2024-01-09T23:09:44Z

Oh, maybe I should have reviewed #1456 first. Will do.

Will revisit once sync pass is done, when we have a TensorIndex

Still missing allocation/indexing of work buffer

I need to replay leaf transforms, then get index.

Codegen is now like ```c++ // Allocate global tensor T5 reduction::serialReductionStep( T3[0LL], T2[(i14 + i18)], 0.000000000e+00f, T5[((((((((((((nvfuser_index_t)blockIdx.x) * 8LL) + ((nvfuser_index_t)blockIdx.y)) * 4LL) + i13) * 8LL) + (i18 + nvfuser_zero)) * 4LL) + ((nvfuser_index_t)threadIdx.y)) * 32LL) + ((nvfuser_index_t)threadIdx.x))], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, true, true); ``` This looks OK, although it will get a little better with hoisting. This compiles, but I get an error in `runFusion`: ``` C++ exception with description "Expected T5_g[ iblockIdx.x59{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(262144, 32) ), 4) ), 8) ), 4) ), 8) )}, iblockIdx.y60{8}, ithreadIdx.y54{4}, ithreadIdx.x52{32}, iS58{4}, iS56{8}, rblockIdx.z49{5} ] to be bound to a tensor of rank 1, but got a tensor of rank 6 Exception raised from validateValWithConcreteValue at /opt/pytorch/nvfuser/csrc/expr_evaluator.cpp:38 (most recent call first): ``` This is happening when binding inputs I believe.

Fixes execution error. Test passes!

Generated kernel now looks like ```c++ // Allocate global tensor T4 grid_sync::blockSerializeWait<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); #pragma unroll for(nvfuser_index_t i13 = 0; i13 < 4LL; ++i13) { nvfuser_index_t i14; i14 = 8LL * i13; nvfuser_index_t i15; i15 = 2048LL * i13; nvfuser_index_t i16; i16 = i4 + i15; nvfuser_index_t i17; i17 = -i15; #pragma unroll for(nvfuser_index_t i18 = 0; i18 < 8LL; ++i18) { nvfuser_index_t i19; i19 = 256LL * (i18 + nvfuser_zero); nvfuser_index_t i20; i20 = i16 + i19; float T3[1LL]; T3[0LL] = 0.000000000e+00f; // Allocate global tensor T5 reduction::serialReductionStep( T3[0LL], T2[(i14 + i18)], 0.000000000e+00f, T5[i20], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, true, true); if ((b6 && (i5 < (i17 - i19)))) { T1[i20] = T3[0LL]; } } } NVFUSER_UPDATE_MAGIC_ZERO; grid_sync::blockSerializeRelease<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); ``` Note that the index `i20` matches the output `T1`. This is what we need to reclaim `T1` in a later PR; it will still be a challenge in that work to exact map between `T5` and `T3` in order to get `T1` and `T5` exact mapped...

Also sort expected output by line to give clearer error messages.

These were disabled in #1545 because of slow compilation with gridReduce

This actually means we will insert st.global.cg as inline Asm instead of the reinterpret_cast-based copy that loadGenericVolatile uses. This fixed the error I was seeing with vec_size=8, indicating that loadGenericVolatile is not obeying the same ordering semantics as we need for serial reduction.

…uction

jacobhinkle · 2024-02-03T02:34:56Z

!build

Now that we have the scheduling test we don't need this. And the matmul scheduler exercises vectorization.

naoyam

LGTM

jacobhinkle · 2024-02-06T01:19:54Z

This is the speedup in nvfuser_bench benchmarks when comparing this PR with main, after merging #1712 into each:

These benchmarks include split-K factors of 2 thru 6. I missed this speedup previously by looking at just a couple of problem sizes, and also I was doing HSS instead of HSH. So even without unswizzling (#1534) or reduced precision reduction (#1719), this PR seems to have a pretty big impact. Unswizzling will let us use a vectorization size of 4 instead of 2, and reduced precision will cut the amount of split-K IO in half, so I expect some further speedups.

jacobhinkle · 2024-02-06T01:55:02Z

!build --diff

jacobhinkle · 2024-02-06T12:50:28Z

Python test failure is just a rare event due to low tolerance.

Batch matmul splitk tests show increase in register usage with this change vs. no change for non-batch matmul tests (see nvfuser-ci/jit_codegen_diff_17_4/4). Possibly related to #1726.

@wujingyue

* print bandwidth when perf_debug_verbose is true (NVIDIA#1689) print bandwidth when `perf_debug_verbose` is true. * in vectorization validation, add err msg if tv has no definition (NVIDIA#1690) check the existence of tv definition in vectorization validation * Accomodate Reduction IterDomains when concretizing reshape extents (NVIDIA#1692) We register extents for concretization when we concretize reshape. In order to do that, we line up `IterDomain`s in the symbolic reshaped TV and the new, concretized one. In cases where the concretized reshape is trivial, such as when the output shape is the same as the input, we do not create a new TV. In those cases, we will have the input to the original `ViewOp` as the concretized output. That input TV might have reduction domains, as in the provided test, in which case we need to filter those out when doing this alignment. This small PR just implements that filtering. Fixes NVIDIA#1691. * `MmaOp::evaluate` method (NVIDIA#1675) * Fix some typos. (NVIDIA#1700) * `torch.compile` and `eager` benchmarks for `softmax` (NVIDIA#1670) Adds `torch.compile` and `eager` baseline benchmarks to be used in weekly benchmark runs. Issue NVIDIA#1668. * Add a test for fusions with no inputs. (NVIDIA#1709) As a follow up to NVIDIA#1696 (comment). * Double the size of the fusion cache to workaround a CI issue. (NVIDIA#1702) By just removing entries when it fills up. * Check that the reduced axis is sharded on producer in isLowerableToCommunication (NVIDIA#1695) Currently, a reduction is lowerable to a communication iff only one axis is reduced and this axis is sharded across devices on the **producer** side. Before this patch, we would mistakenly check that the axis is sharded on **consumer** side, which led to some runtime assert error. * Add blank impl of isLowerableToCommunication. (NVIDIA#1698) isLowerableToCommunication is used in a few places to print error messages or short-circuit loops. Those places appear to be places that are intended to largely be used behind the distributed path. It's easier to just define the API instead of trying to conditionalize all the use sites and invent non-USE_DISTRIBUTED behavior. * Multidevice segmenter (NVIDIA#1696) # What Add an option in the segmenter to segment resharding Expr in separate singleton segment. To trigger it, set the segmenter's options as follows: ``` SegmentCandidateFinderOptions options{ .run_translate_welford = false, .run_combine_reductions = false, .run_herrmann_merge = true, .run_final_merge = true, .only_segment_resharding_exprs = true}; ``` and use the segmenter as follows with any (possibly dummy) inputs: ``` KernelArgumentHolder dummy_inputs; auto segmented_fusion = SegmentCandidateFinder::segment(std::move(fusion), dummy_inputs, options); ``` If `only_segment_resharding_exprs` is set to `false` (which is the case by default), the behavior of the segmenter is unchanged. We also provide a quite wide testing suite to validate our implementation. # Why Resharding Exprs need to be handled differently than other Exprs because we want them to result in posting a network collective from the host. Therefore those expressions cannot (for now) be fused to any kernel. For this reason, we need those Expr to be segmented before and after. # How _**Remark:** For now, the segmenter is only used [at one place before scheduling and compiling the fusion](https://github.com/NVIDIA/Fuser/blob/1603f39bab8c1bbe12e38f2b5de53dec3b7cc373/csrc/kernel_cache.cpp#L990)._ Recall that the segmenter first creates as many segments as there are Expr and then tries to merge the neighbour segments incrementally in an eager manner. The method ``` bool SegmentCandidateFinder::codeGenSupportedMerge( SegmentedGroup* group1, SegmentedGroup* group2) ``` returns whether two groups can be merged (i.e. fused into one kernel). With the current patch, if `SegmentCandidateFinderOptions::only_segment_resharding_exprs` is set to `true`, then the usual behavior of `codeGenSupportedMerge` is bypassed and the function returns whether one Expr among the groups is resharding. Because this segmentation shouldn't depend on the inputs data, we use default (aka empty) `KernelArgumentHolder`, from which it is invalid to instantiate a `SchedulerRuntimeInfo runtime_info_`. For this reason, we had to make the latter attribute optional. # Future/other directions Another way to achieve the same result is to manually add segment bounds surrounding the resharding Exprs as was suggested by @wujingyue here NVIDIA#1571 The current implementation looks a bit "hacky" and should be be integrated more properly once multidevice schedulers are implemented and/or the segmenter is refactored. Later, we might wanna be able to fuse communications and computes and also communications between them. This would require a more advanced segmenter and scheduler, but hopefully this patch could serve as a good basis # Example: consider the fusion: ``` auto fusion = std::make_unique<Fusion>(); FusionGuard fg(fusion.get()); TensorView* tv0 = makeContigTensor({4}); fusion->addInput(tv0); TensorView* tv1 = sum(tv0,{3}); TensorView* tv2 = set(tv1); TensorView* tv3 = sum(tv2, {2}); fusion->addOutput(tv3); ``` Manually scheduled as follows: ``` DeviceMesh mesh ({0,1,2,3}) for (auto tv : {tv0, tv1, tv2, tv3}) { tv->setDeviceMesh(mesh); } tv0->axis(0)->parallelize(ParallelType::DIDx); tv1->axis(0)->parallelize(ParallelType::DIDx); ``` This scheduling implies that - `tv0` and `tv1` are fully sharded on the devices {0,1,2,3} - `tv2` and `tv3` are fully replicated on those same devices - consequently, the "set" operation on the line `tv2 = set(tv1)` actually embedds an "AllGather" network collective. This Expr is resharding while all the other exprs are not. We thus excpect this expression to constitute an unmergeable segment. The segmenter in this situation with the option`SegmentCandidateFinderOptions::only_segment_resharding_exprs` set to `true` will result in three segments: - Compute segment 1: with the expr `tv1 = sum(tv0,{3})` - Communication segment 1: with the expr `tv2 = set(tv1)` - Compute segment 2: with the expr `tv3 = sum(tv2, {2})` * Vectorization Factor patch for computeInfoC2P with Broadcast in mapped IterDomain (NVIDIA#1625) Fixes NVIDIA#1567 This PR patches vectorization factor in `ContiguousInnerDimensionsMapper::computeInfoC2P`. Handling of resolved broadcast dimension should be made on mapped consumer tensors' from_ids, instead of the root_domain order. Added a few tests per @zasdfgbnm 's suggestion: ``` Case 0: T2[1024, 2, 512] = T0[1024, 2, 1] + T1[1024, 2, 512] allocation = rfactor --> T0 has no vectorization Case 1: T2[1024, 512, 2] = T0[1024, 1, 2] + T1[1024, 512, 2] allocation = rfactor --> T0 has vectorization 2 Case 2: T2[1024, 512, 2] = T0[1024, 1, 2] + T1[1024, 512, 2]; T3[512, 1024, 2] = transpose(T2[1024, 512, 2]) allocation = rfactor *except T1 has stride_order {1, 2, 0} --> T0 has vectorization 4 Case 3: T2[512, 1024, 2] = T0[1, 1024, 2] + T1[512, 1024, 2] T3[1024, 512, 2] = transpose(T2[512, 1024, 2]) allocation = rfactor --> T0 has vectorization 2 ``` --------- Co-authored-by: Jacob Hinkle <[email protected]> Co-authored-by: Gao, Xiang <[email protected]> * transpose scheduler fix: reduction IterDomain on input tensors (NVIDIA#1661) Fixes NVIDIA#1659 Reorders reduction IterDomain so it won't interfere with scheduling tiling from transpose scheduler. * Convert reduction of expanded dims to squeeze (NVIDIA#1679) See comment in arith.cpp for details. One controversial change here is to allow squeezing expanded dimensions, both in our IR's `SqueezeOp` and in the user-facing functions `squeeze`. This results in actually removing those dimensions. This behavior diverges from PyTorch, whose `squeeze` command will ignore requested squeezes if the size is not 1 regardless of whether that dimension is expanded. I'm happy to discuss this change and potentially take another course, but I think we do need to be able to remove expanded axes (see NVIDIA#1174 (comment) for another case where I encountered this limitation). Fixes NVIDIA#1678 * Make sure ValGraphs are created deterministically (NVIDIA#1714) While I was working on NVIDIA#32, I sometimes saw non-deterministic results. Hope this is the only source of non-determinism. * Fix squeeze-related errors (NVIDIA#1717) This fixes current failures in `pytest_ops.py -k squeeze` and some integration failues. This restores our previous semantics for squeeze, which **do not match PyTorch**. Namely, if squeeze is provided a dimension that cannot be squeezed, we will always raise an error. * NVFUSER_DISTRIBUTED instead of USE_DISTRIBUTED (NVIDIA#1711) * Add the missing `clang-format on` and reformat. (NVIDIA#1722) * Print a newline before the header. (NVIDIA#1720) * Associate each fusion cache with its local rank in distributed setting. (NVIDIA#1699) ### Problem: Currently, automatic serialization saves a single cache regardless of the number of devices. In a distributed setting, each process restores its fusion cache from the same common workspace. However, this workspace only contains the CUDA kernels for a single device. The remaining processes must recompile the kernels for their devices. ### Solution: A separate process is created for each device with `ddp` or `fsdp` and each process contains a separate `FusionCache`. This PR associates each fusion cache with its local rank in a distributed setting, allowing automatic serialization to create a separate workspace for each device. During deserialization, each process loads the workspace associated with its local rank. * Vectorized serial grid reduction (NVIDIA#1528) This change allows us to use vectorized loads/stores in `serialReductionStep`. The generated kernel now looks like ```c++ NVFUSER_UPDATE_MAGIC_ZERO; grid_sync::blockSerializeWait<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); #pragma unroll for(nvfuser_index_t i16 = 0; i16 < 4LL; ++i16) { nvfuser_index_t i17; i17 = 32LL * i16; nvfuser_index_t i18; i18 = 4096LL * i16; nvfuser_index_t i19; i19 = i5 + i18; nvfuser_index_t i20; i20 = -i18; #pragma unroll for(nvfuser_index_t i21 = 0; i21 < 8LL; ++i21) { nvfuser_index_t i22; i22 = 512LL * (i21 + nvfuser_zero); Array<float, 4LL, 4> T3; T3.set(float(0.000000000e+00f)); reduction::serialReductionStep</*vec_size=*/4>( &T3[0LL], &T2[(i17 + (4LL * i21))], 0.000000000e+00f, &T6[(i19 + i22)], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, true, true); if ((b7 && (i6 < (i20 - i22)))) { loadLocalToGlobal<float, /*vec_size=*/4, /*is_volatile=*/false>( &T1[(i19 + i22)], &T3[0LL]); } } } grid_sync::blockSerializeRelease<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); NVFUSER_UPDATE_MAGIC_ZERO; ``` * removing out-dated assert on python API (NVIDIA#1724) removing out-dated asserts in python API `define_vector`; adding a tests verifying the behavior * make ci green again (NVIDIA#1730) skip failing test. Please enable it once we patch NVIDIA#1728 * Remove unnecessary `MATCHER_P`. (NVIDIA#1729) * Fix Issue NVIDIA#1734 (NVIDIA#1735) Closes Issue NVIDIA#1734 * Rename `AliasType` -> `AllocationType` (NVIDIA#1732) * Skip executing a kernel if it's empty. (NVIDIA#1723) I could change `compileFusion` to skip compilation as well. It turned out to be more complicated than I expected, so I took the easier route to skip just execution, which is at least an incremental improvement. * don't cache slice input tv (NVIDIA#1705) If the input tv is used by slice, don't cache it. Fix NVIDIA#1697 * Make `MmaOp::evaluate` return output of the same dtype as `MmaOp` (NVIDIA#1733) * Turing/Ampere Mma tests without `BroadcastOp` (NVIDIA#1672) This PR renames `matmulAtInput` into `matmulAtInput2D`, explicitly showing that it generates 2D inputs. This PR also adds a `matmulAtInput3DTuring`, which is used to generate the 3D fusion inputs (for example `[M, 1, K]` and `[1, K, N]`) for matmul. The `MmaTest` for Turing and Ampere is modified to exclude the `BroadcastOp` and use the 3D version for generating fusion inputs. This is only the initial step for making `scheduleMatmul` schedule a fusion not containing `BroadcastOp`, I intentionally keep it small. Other changes will be added in followup PRs. Fixes NVIDIA#1628 * io_alias_ const update (NVIDIA#1740) * Add benchmarks for RoPE. (NVIDIA#1739) This PR adds two implementations of the RoPE module and benchmarks them for NVIDIA#1597. `rope_with_cat_fusion` mimics the Hugging Face implementation. `rope_without_cat_fusion` implements an idea from @nikitaved to avoid concatenation. Even though it looks difficult for the compiler to do it all automatically, it's still useful to keep a record of the idea. As a side change, I made `fd.define_tensor` to accept empty contiguity. * Make nvfuser matmul benchmarks HSH instead of HSS (NVIDIA#1712) This matches the `at::matmul` baselines. This PR also adds a few more problem sizes, and runs each eagermode baseline with and without FP16 reduction allowed. * Reduce number of `MmaTest`s (NVIDIA#1738) This PR is stacked on top of NVIDIA#1672 Turing/Ampere mma is only TN, so it makes no sense to test other layouts in `MmaTest`s. These tests are intended to test mma instructions, `ldmatrix` and `ldmatrix.trans` is tested separately in other unit tests. Similar for `HopperRS` tests. * Weekly Benchmarks Input Range (NVIDIA#1708) * Rename axes= to dims= in frontend (NVIDIA#1741) Currently we accept `axes=` for some ops like `fd.ops.sum` and `dims=` for others like `fd.ops.squeeze`. This is a small attempt to make the frontend arguments more consistent. This change renames the `axis=` kwarg to `dim=` and the same for `axes=` -> `dims=`. I think we're free to set our own convention, but for reference: - PyTorch uses `dim=` in most places and accepts either a single dim or multiple using that same argument name, where applicable. - Numpy uses `axis=` and, like PyTorch, accepts a list where applicable. - `jax.lax` uses `dimensions=` * Avoid unused smem workspace for serial grid reductions (NVIDIA#1727) GridReduction can be lowered to either `gridReduce` or `serialReductionStep`. `gridReduce` requires a smem workspace in order to use multiple threads to aggregate partial sums. However, `serialReductionStep` does not coordinate among threads and has no use for a workspace. This change simply disables allocating that little bit of extra shared memory if our only grid reductions are serial, which currently only happens in split-K GEMM. This reduces the smem allocated in a simple test from 16896 B to 16384 B (about 97%). More importantly, this makes the computation in `mma_utils::generateSharedMemoryEpilogueHeuristics()` more accurate. Tests are updated to check that this computation is accurate. The change in `kernel.cpp` is responsible for reducing actual smem usage for split-K. The changes to `mma_utils` and `test_gpu_tensorcore.cpp` are needed for adding testing that our expected smem usage matches the actual usage. * Issue NVIDIA#1748 (NVIDIA#1749) Closes Issue NVIDIA#1748. Apart from `c10::cuda::GetDevice`, no other functionality seems affected. * Rename `axes` to `dims` in benchmarks fusion definitions (NVIDIA#1751) Changes the kwarg `axes` to `dims` following the API change in PR NVIDIA#1741. * Bump matmul benchmark checkMatch() tolerance (NVIDIA#1747) This is necessary due to recent switch to HSH Fixes NVIDIA#1746 * linter * change guard USE_DISTRIBUTED to NVFUSER_DISTRIBUTED in test/test_multidevice_sharding.cpp * linting * linter and cleanup * remove allocator.h/cpp files * Device index patch (NVIDIA#1752) Fixes NVIDIA#1748 guard c10::cuda::GetDevice API change on TORCH_VERSION with this change, it ensures that we can build against stable release `< 2.2.0`, as well as TOT after pytorch/pytorch#119142 For 2.3.0 nightly, if someone accidentally checkout a commit before the patch, the build will still fail. * fixing multidevice build (NVIDIA#1753) API change coming from pytorch/pytorch#119421 * patching API GUARD (NVIDIA#1754) patching API version guard so we'll still be able to build against older pytorch version. * Add a visitor for ValGraph (NVIDIA#1713) Used in the loop promotion analysis. Extracted from NVIDIA#32 * empty commit for triggering CI --------- Co-authored-by: Liqiang Lu <[email protected]> Co-authored-by: Jacob Hinkle <[email protected]> Co-authored-by: Priya Mishra <[email protected]> Co-authored-by: Jingyue Wu <[email protected]> Co-authored-by: Tom Fogal <[email protected]> Co-authored-by: jjsjann123 <[email protected]> Co-authored-by: Gao, Xiang <[email protected]> Co-authored-by: Naoya Maruyama <[email protected]> Co-authored-by: Meghan Cowan <[email protected]> Co-authored-by: Ryan Spring <[email protected]>

Serial grid reductions are used in split-K matmuls as of #1510. This means we load and store elements in the reduction tensor according to the indexing of the work buffer. This is unlike ordinary grid reductions that use `gridReduce`, which reduces individual elements using a scheme that ensures coalescing by indexing into the work buffer based on `threadIdx` and `blockIdx`. Currently these split-K accesses are inefficient due to this lack of coalescing. We currently already ensure coalesced output stores in matmuls when possible by using smem for the epilogue (#387). A shared memory buffer is used to communicate elements between threads so that the resulting tensor will have a proper global access pattern when it is written out to global memory as a tile of the output. Before this PR if we used split-K with `use_smem_epilogue = true`, the store to global memory will be coalesced but there will be uncoalesced accesses during the split-K reduction. This PR modifies scheduling so that in those cases, the smem epilogue tensor is placed before the split-K sum, so that unswizzling happens before completing the reduction. The result is that the reduction accesses are coalesced. This is a generated kernel from `NVFuserTest.FusionAmpereMatmulSplitKBias_CUDA`: ```c++ // ... (main loop) ... #pragma unroll for(nvfuser_index_t i59 = 0; i59 < 4LL; ++i59) { nvfuser_index_t i104; i104 = 8LL * i59; nvfuser_index_t i105; i105 = 32LL * i59; #pragma unroll for(nvfuser_index_t i61 = 0; i61 < 8LL; ++i61) { nvfuser_index_t i106; i106 = 4LL * i61; asm( "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 {%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%10, %11, %12, %13};\n" :"=f"((*reinterpret_cast<Array<float, 4, 1>*>(&T16[(i105 + i106)]))[0]), "=f"((*reinterpret_cast<Array<float, 4, 1>*>(&T16[(i105 + i106)]))[1]), "=f"((*reinterpret_cast<Array<float, 4, 1>*>(&T16[(i105 + i106)]))[2]), "=f"((*reinterpret_cast<Array<float, 4, 1>*>(&T16[(i105 + i106)]))[3]) :"r"((*reinterpret_cast<Array<uint32_t, 4, 1>*>(&T4[i104]))[0]), "r"((*reinterpret_cast<Array<uint32_t, 4, 1>*>(&T4[i104]))[1]), "r"((*reinterpret_cast<Array<uint32_t, 4, 1>*>(&T4[i104]))[2]), "r"((*reinterpret_cast<Array<uint32_t, 4, 1>*>(&T4[i104]))[3]), "r"((*reinterpret_cast<Array<uint32_t, 2, 1>*>(&T5[i106]))[0]), "r"((*reinterpret_cast<Array<uint32_t, 2, 1>*>(&T5[i106]))[1]), "f"((*reinterpret_cast<Array<float, 4, 1>*>(&T16[(i105 + i106)]))[0]), "f"((*reinterpret_cast<Array<float, 4, 1>*>(&T16[(i105 + i106)]))[1]), "f"((*reinterpret_cast<Array<float, 4, 1>*>(&T16[(i105 + i106)]))[2]), "f"((*reinterpret_cast<Array<float, 4, 1>*>(&T16[(i105 + i106)]))[3]) ); } } } NVFUSER_UPDATE_MAGIC_ZERO; __syncthreads(); } __syncthreads(); #pragma unroll for(nvfuser_index_t i107 = 0; i107 < 4LL; ++i107) { nvfuser_index_t i108; i108 = 32LL * i107; nvfuser_index_t i109; i109 = i38 + (2048LL * i107); #pragma unroll for(nvfuser_index_t i110 = 0; i110 < 8LL; ++i110) { nvfuser_index_t i111; i111 = i108 + (4LL * i110); nvfuser_index_t i112; i112 = i11 + i110; nvfuser_index_t i113; i113 = (i109 + (32LL * (i112 / 4LL))) + (8LL * (i39 ^ (i112 % 4LL))); #pragma unroll for(nvfuser_index_t i114 = 0; i114 < 2LL; ++i114) { loadGeneric<float, 2>( &T17[(i113 + (1024LL * i114))], &T16[(i111 + (2LL * i114))]); } } } NVFUSER_UPDATE_MAGIC_ZERO; // Allocate global tensor T19 grid_sync::blockSerializeWait<false, false, true>(&T19[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); __syncthreads(); #pragma unroll for(nvfuser_index_t i115 = 0; i115 < 32LL; ++i115) { nvfuser_index_t i116; i116 = i115 + nvfuser_zero; nvfuser_index_t i117; i117 = i44 + (i45 * i116); nvfuser_index_t i118; i118 = i47 + (4LL * i115); bool b119; b119 = i55 < (-(4LL * i116)); bool b120; b120 = b54 && b119; Array<float, 4LL, 4> T6; T6.set(float(0.000000000e+00f)); // Allocate global tensor T20 reduction::serialReductionStep</*vec_size=*/4>( &T6[0LL], &T17[(i42 + (512LL * i115))], 0.000000000e+00f, &T20[i117], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, b120, b120); Array<float, 4LL, 4> T10; #pragma unroll for(nvfuser_index_t i121 = 0; i121 < 4LL; ++i121) { __half T18[1LL]; T18[0LL] = 0LL; if (b119) { T18[0LL] = T2[(i118 + ((i48 + (i121 + nvfuser_zero)) / 128LL))]; } __half T7[1LL]; T7[0LL] = T18[0LL]; float T8[1LL]; T8[0LL] = __half2float(T7[0LL]); T10[i121] = T6[i121] + T8[0LL]; } if ((b56 && b119)) { loadLocalToGlobal<float, /*vec_size=*/4, /*is_volatile=*/false>( &T9[i117], &T10[0LL]); } } NVFUSER_UPDATE_MAGIC_ZERO; grid_sync::blockSerializeRelease<false, false, true>(&T19[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); } ``` Note that the `i135` loop will be smaller once we have #1528 at which point it would more clearly show reduction followed by the loop for the predicated bias epilogue. (Diff should be viewed hiding whitespace changes as many changes are to indentation).

jacobhinkle mentioned this pull request Dec 14, 2023

Split-K matmul is slow even for few partitions #1316

Open

jacobhinkle commented Dec 14, 2023

View reviewed changes

test/test_serial_gridreduce.cpp Outdated Show resolved Hide resolved

jacobhinkle changed the title ~~[WIP] Vectorized serial grid reduction~~ Vectorized serial grid reduction Dec 18, 2023

jacobhinkle requested a review from naoyam January 8, 2024 17:52

jacobhinkle marked this pull request as ready for review January 8, 2024 17:52

jacobhinkle mentioned this pull request Jan 9, 2024

Unswizzle before grid reduction in split-K #1534

Merged

jacobhinkle commented Jan 9, 2024

View reviewed changes

naoyam approved these changes Jan 9, 2024

View reviewed changes

jacobhinkle marked this pull request as draft January 11, 2024 17:31

jacobhinkle force-pushed the vectorized_serial_reduction branch from 50186ab to 9b950a8 Compare January 19, 2024 13:17

jacobhinkle added 19 commits January 19, 2024 13:18

Start lowering serial grid reduction

aa6598d

Add grid_serialization.{cpp,h}

68fe6e8

Add requestSerialGridReduction

e27440c

Disable previous changes to indexing pass.

e04cea5

Will revisit once sync pass is done, when we have a TensorIndex

Call insertGridSerializationSyncs pass

ebec0e5

Remove file added by mistake

2530e51

Fix formatting lintrunner messed up

f395b57

Bump num_reduction_op_attr

a6f6611

Add test

109e878

Fix sync insertion in lowering pass.

e96a57f

Still missing allocation/indexing of work buffer

Fix missing allocation of sync flag buffer

43234b9

Allocate global work buffer. Index is zero for now

c9b126e

I need to replay leaf transforms, then get index.

Taking a stab at replay/indexing of intermediate

a2aec20

Infer shape using allocation domain instead of root

76f55b9

Fixes execution error. Test passes!

Update comments

a92c0bd

Clean up comments.

10872c1

Update NVFuserTest.Pipeline_CUDA

2ca845a

Also sort expected output by line to give clearer error messages.

jacobhinkle added 9 commits January 19, 2024 13:18

Copy domains to create new TensorDomain instead of reusing

83c60a5

Allocate work buffer like leaf of output

5f9c1cf

Use serial grid reduction in split-K

c6d3f10

Set proper dtype for init in MmaOp

d90fbb0

Restore split-k benchmarks

9286006

These were disabled in #1545 because of slow compilation with gridReduce

Fix after rebase

491280d

Vectorized serial grid reduction

a9cd7b1

Remove debug prints

5171f41

jacobhinkle force-pushed the vectorized_serial_reduction branch from 9b950a8 to a24a552 Compare January 19, 2024 13:18

jacobhinkle added 2 commits January 19, 2024 13:39

Restore main

25c8d77

Restore changes from branch

c9e27a7

jacobhinkle marked this pull request as ready for review January 19, 2024 14:27

jacobhinkle added 5 commits February 2, 2024 16:54

Delint test

e40addb

Restore check for SerialGridReduction in validateAndCollectVectorizeInfo

b705d06

Merge remote-tracking branch 'origin/main' into vectorized_serial_red…

68a268b

…uction

Fix typo

dd9405b

Vectorize in scheduler

f93ddc7

Remove obsolete CodegenNodes test

7dcacf3

Now that we have the scheduling test we don't need this. And the matmul scheduler exercises vectorization.

naoyam approved these changes Feb 5, 2024

View reviewed changes

Merge branch 'main' into vectorized_serial_reduction

99263d6

jacobhinkle merged commit 28f61dd into main Feb 6, 2024
32 of 37 checks passed

jacobhinkle deleted the vectorized_serial_reduction branch February 6, 2024 13:13

jacobhinkle mentioned this pull request Feb 12, 2024

Half-precision reduction for split-K #1719

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized serial grid reduction #1528

Vectorized serial grid reduction #1528

jacobhinkle commented Dec 14, 2023 •

edited

Loading

jacobhinkle Jan 9, 2024

jacobhinkle Feb 2, 2024

naoyam left a comment

naoyam Jan 9, 2024

jacobhinkle Jan 11, 2024

naoyam Jan 11, 2024 •

edited

Loading

jacobhinkle Jan 12, 2024 •

edited

Loading

naoyam commented Jan 9, 2024

jacobhinkle commented Feb 3, 2024

naoyam left a comment

jacobhinkle commented Feb 6, 2024 •

edited

Loading

jacobhinkle commented Feb 6, 2024

jacobhinkle commented Feb 6, 2024 •

edited

Loading

Vectorized serial grid reduction #1528

Vectorized serial grid reduction #1528

Conversation

jacobhinkle commented Dec 14, 2023 • edited Loading

jacobhinkle Jan 9, 2024

Choose a reason for hiding this comment

jacobhinkle Feb 2, 2024

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

naoyam Jan 9, 2024

Choose a reason for hiding this comment

jacobhinkle Jan 11, 2024

Choose a reason for hiding this comment

naoyam Jan 11, 2024 • edited Loading

Choose a reason for hiding this comment

jacobhinkle Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

naoyam commented Jan 9, 2024

jacobhinkle commented Feb 3, 2024

naoyam left a comment

Choose a reason for hiding this comment

jacobhinkle commented Feb 6, 2024 • edited Loading

jacobhinkle commented Feb 6, 2024

jacobhinkle commented Feb 6, 2024 • edited Loading

jacobhinkle commented Dec 14, 2023 •

edited

Loading

naoyam Jan 11, 2024 •

edited

Loading

jacobhinkle Jan 12, 2024 •

edited

Loading

jacobhinkle commented Feb 6, 2024 •

edited

Loading

jacobhinkle commented Feb 6, 2024 •

edited

Loading