Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transpose scheduler merges IterDomains with different iteration types. #1659

Closed
jjsjann123 opened this issue Jan 23, 2024 · 5 comments · Fixed by #1661
Closed

Transpose scheduler merges IterDomains with different iteration types. #1659

jjsjann123 opened this issue Jan 23, 2024 · 5 comments · Fixed by #1661
Assignees
Labels
bug Something isn't working

Comments

@jjsjann123
Copy link
Collaborator

repro script

import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1, -1, 1], contiguity=[True, True, None], dtype=DataType.Float, is_cpu=False, stride_order=[2, 1, 0])
    T1 = fd.define_tensor(shape=[-1, 1, -1], contiguity=[True, None, True], dtype=DataType.Float, is_cpu=False, stride_order=[2, 1, 0])
    T2 = fd.ops.sum(T0, axes=[1], keepdim=False, dtype=DataType.Null)
    T3 = fd.ops.sum(T1, axes=[1], keepdim=False, dtype=DataType.Null)
    T4 = fd.ops.mul(T2, T3)
    fd.add_output(T4)

with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)

inputs = [
    torch.randn((524288,), dtype=torch.float32, device='cuda:0').as_strided((1024, 512, 1), (512, 1, 1)),
    torch.randn((524288,), dtype=torch.float32, device='cuda:0').as_strided((1024, 1, 512), (512, 512, 1)),
]
fd.execute(inputs)

Running into issues:

Traceback (most recent call last):
  File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 137, in execute
    result = self._execute(
RuntimeError: Merging IterDomains requires that their iteration types match. Outer: iS69{32}, Inner: rS7{i1}
Exception raised from merge at /opt/pytorch/nvfuser/csrc/ir/nodes.cpp:2692 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7fd2d0278a43 in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #1: nvfuser::IterDomain::merge(nvfuser::IterDomain*, nvfuser::IterDomain*, bool) + 0x39b (0x7fd2d04fd27b in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #2: nvfuser::TensorDomain::merge(int, int) + 0xc9 (0x7fd2d04fd3d9 in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #3: nvfuser::TensorView::merge(int, int) + 0xdb (0x7fd2d07cffcb in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #4: nvfuser::scheduleTranspose(nvfuser::Fusion*, nvfuser::TransposeParams) + 0x109b (0x7fd2d071fb5b in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #5: nvfuser::TransposeScheduler::schedule(nvfuser::Fusion*) + 0xe0 (0x7fd2d0721db0 in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #6: nvfuser::FusionKernelRuntime::compileKernel(nvfuser::KernelArgumentHolder const&, nvfuser::SegmentedGroup*) + 0x151 (0x7fd2d0533ab1 in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #7: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x41c (0x7fd2d0539b9c in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #8: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa43 (0x7fd2d0545523 in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #9: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, bool, bool, std::optional<signed char>) const + 0x16c (0x7fd2d0854a2c in /opt/pytorch/nvfuser/nvfuser/lib/libnvfuser_codegen.so)
frame #10: <unknown function> + 0xf152e (0x7fd2d0ba252e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #11: <unknown function> + 0x177488 (0x7fd2d0c28488 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
<omitting python frames>
frame #27: <unknown function> + 0x29d90 (0x7fd4f73bbd90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #28: __libc_start_main + 0x80 (0x7fd4f73bbe40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
@jjsjann123
Copy link
Collaborator Author

It's asserting here: https://github.com/NVIDIA/Fuser/blob/da7c4e97e3b90267a5465de856afc5e8d64fc56a/csrc/scheduler/transpose.cpp#L1239C1-L1239C26

roughly looking like it's the propagation that's failing the task.
printing out the fusion right before the assert (after global scheduling is done.

global reference looks like this:

T4_g[ iblockIdx.x33{( ceilDiv(( ( ceilDiv(i0, 32) ) * ( ceilDiv(i6, 32) ) ), 1) )}, iUS34{1}, iS29{32}, iS31{32} ]
 root domain : (iS26{i0}, iS27{i6})
 contiguity: t t
  Split: iS26{i0} by factor 32 -> iS30{( ceilDiv(i0, 32) )}, iS31{32}, start offset: 0, stop offset: 0
  Split: iS27{i6} by factor 32 -> iS28{( ceilDiv(i6, 32) )}, iS29{32}, start offset: 0, stop offset: 0
  Merge: iS30{( ceilDiv(i0, 32) )} and iS28{( ceilDiv(i6, 32) )} -> iS32{( ( ceilDiv(i0, 32) ) * ( ceilDiv(i6, 32) ) )}
  Split: iS32{( ( ceilDiv(i0, 32) ) * ( ceilDiv(i6, 32) ) )} by factor 1 -> iblockIdx.x33{( ceilDiv(( ( ceilDiv(i0, 32) ) * ( ceilDiv(i6, 32) ) ), 1) )}, iUS34{1}, start offset: 0, stop offset: 0
 leaf domain : (iblockIdx.x33{( ceilDiv(( ( ceilDiv(i0, 32) ) * ( ceilDiv(i6, 32) ) ), 1) )}, iUS34{1}, iS29{32}, iS31{32})

But propagated reference2 looks like this:

T2_g[ iS75{( ceilDiv(( ( ceilDiv(i0, 32) ) * ( ceilDiv(1, 32) ) ), 1) )}, iS76{1}, bS73{32}, iS71{32}, rS7{i1} ]
 root domain : (iS6{i0}, rS7{i1}, bS8{1})
 contiguity: t n n
  Split: iS6{i0} by factor 32 -> iS70{( ceilDiv(i0, 32) )}, iS71{32}, start offset: 0, stop offset: 0
  Split: bS8{1} by factor 32 -> bS72{( ceilDiv(1, 32) )}, bS73{32}, start offset: 0, stop offset: 0
  Merge: iS70{( ceilDiv(i0, 32) )} and bS72{( ceilDiv(1, 32) )} -> iS74{( ( ceilDiv(i0, 32) ) * ( ceilDiv(1, 32) ) )}
  Split: iS74{( ( ceilDiv(i0, 32) ) * ( ceilDiv(1, 32) ) )} by factor 1 -> iS75{( ceilDiv(( ( ceilDiv(i0, 32) ) * ( ceilDiv(1, 32) 
) ), 1) )}, iS76{1}, start offset: 0, stop offset: 0
 leaf domain : (iS75{( ceilDiv(( ( ceilDiv(i0, 32) ) * ( ceilDiv(1, 32) ) ), 1) )}, iS76{1}, bS73{32}, iS71{32}, rS7{i1})

For some reason the rS7{i1} is reordered to be the inner most dimension and it doesn't look right. Since we are expecting the tile to be inner most.

segmented fusion for the reference:

g{(transpose)
inputs:
T1_g[ iS13{i0}, bS4{1}, iS5{i6} ] float
T2_g[ iS6{i0}, rS7{i1}, bS8{1} ] float
outputs:
T4_g[ iS11{i0}, iS12{i6} ] float
 
 
T3_g[ iS14{i0}, iS10{i6} ]
   = squeeze( T1_g[ iS13{i0}, bS4{1}, iS5{i6} ] )
(1)
T4_g[ iS11{i0}, iS12{i6} ]
   = T2_g[ iS6{i0}, rS7{i1}, bS8{1} ]
   * T3_g[ iS14{i0}, iS10{i6} ];
(2)
}

@jjsjann123
Copy link
Collaborator Author

logging offline conversation with @naoyam :

Suggested that the reduction domain here shouldn't matter and we don't have a protocol on where to put that dangling reduction iterdomain on inputs.

I'll just re-order it past the tiling. 🤞 Hopefully this would work well with propagation to other tensors in the group2 for transpose scheduler.

@jjsjann123
Copy link
Collaborator Author

Note to myself. I'm also not totally sure how we are handling dims_merged_with_1/2 in maybeBuildVirtualInnerDims. Are we safely ignoring reduction iterdomains there? seems not since we are naively looking at shape_in_ref1. I should try to tweak the repro to verify that as well.

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Jan 26, 2024

TEST_F(TransposeTest, TrivialReductionIterDomainOnInputsIssueRepro1659_part2) {
  auto fusion = std::make_unique<Fusion>();
  auto fusion_ptr = fusion.get();
  FusionGuard fg(fusion_ptr);

  auto tv0 = TensorViewBuilder()
                 .ndims(4)
                 .contiguity({true, std::nullopt, true, std::nullopt})
                 .shape({-1, 1, -1, 1})
                 .dtype(DataType::Float)
                 .build();
  fusion->addInput(tv0);
  auto tv1 = TensorViewBuilder()
                 .ndims(4)
                 .contiguity({true, true, std::nullopt, true})
                 .shape({-1, -1, 1, -1})
                 .dtype(DataType::Float)
                 .build();
  fusion->addInput(tv1);
  auto tv2 = sum(tv0, {2});
  auto tv3 = squeeze(tv1, std::vector<int64_t>{2});
  auto tv4 = add(tv2, tv3);
  fusion->addOutput(tv4);

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);

  auto t0 = at::randn({1024, 1, 512, 1}, options);
  auto t1 = at::randn({1024, 128, 1, 4}, options);
  std::vector<c10::IValue> aten_inputs({t0, t1});

  FusionExecutorCache executor_cache(std::move(fusion));
  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);

  auto runtime = executor_cache.getMostRecentKernelRuntime();
  NVF_CHECK(runtime->isSegmented(), "Segmentation expected");
  auto heuristic0 =
      runtime->schedulerHeuristics()->heuristicsList().at(0).get()->heuristic();
  NVF_CHECK(
      heuristic0 == ScheduleHeuristic::Reduction,
      "Unexpected heuristic: ",
      heuristic0);
  auto heuristic1 =
      runtime->schedulerHeuristics()->heuristicsList().at(1).get()->heuristic();
  NVF_CHECK(
      heuristic1 == ScheduleHeuristic::Transpose,
      "Unexpected heuristic: ",
      heuristic1);
  auto tv_ref = t0.sum(2) + t1.squeeze(2);
  testValidate(fusion_ptr, cg_outputs, {t0, t1}, __LINE__, __FILE__);
  // testValidate(fusion_ptr, cg_outputs, {t0, t1}, {tv_ref}, __LINE__, __FILE__);
}

This is a bit scary, I'm getting wrong result from this example above (after the fix in #1661). Let me figure out if that fix accidentally impacted merge axis. 🤞

@jjsjann123
Copy link
Collaborator Author

I think the indexing issue is a separate thing with the failing example above. I'll open a separate issue to track that. Meanwhile, I think the reorder thing should be a small enough thing that I'd like it to go in as a standalone PR. Even though it exposed the other indexing issue (maybe?!)

jjsjann123 added a commit that referenced this issue Feb 1, 2024
Fixes #1659 

Reorders reduction IterDomain so it won't interfere with
scheduling tiling from transpose scheduler.
cowanmeg added a commit to samnordmann/Fuser that referenced this issue Feb 13, 2024
* print bandwidth when perf_debug_verbose is true (NVIDIA#1689)

print bandwidth when `perf_debug_verbose` is true.

* in vectorization validation, add err msg if tv has no definition (NVIDIA#1690)

check the existence of tv definition in vectorization validation

* Accomodate Reduction IterDomains when concretizing reshape extents (NVIDIA#1692)

We register extents for concretization when we concretize reshape. In
order to do that, we line up `IterDomain`s in the symbolic reshaped TV
and the new, concretized one. In cases where the concretized reshape is
trivial, such as when the output shape is the same as the input, we do
not create a new TV. In those cases, we will have the input to the
original `ViewOp` as the concretized output. That input TV might have
reduction domains, as in the provided test, in which case we need to
filter those out when doing this alignment. This small PR just
implements that filtering.

Fixes NVIDIA#1691.

* `MmaOp::evaluate` method (NVIDIA#1675)

* Fix some typos. (NVIDIA#1700)

* `torch.compile` and `eager` benchmarks for `softmax` (NVIDIA#1670)

Adds `torch.compile` and `eager` baseline benchmarks to be used in
weekly benchmark runs.
Issue NVIDIA#1668.

* Add a test for fusions with no inputs. (NVIDIA#1709)

As a follow up to
NVIDIA#1696 (comment).

* Double the size of the fusion cache to workaround a CI issue. (NVIDIA#1702)

By just removing entries when it fills up.

* Check that the reduced axis is sharded on producer in isLowerableToCommunication (NVIDIA#1695)

Currently, a reduction is lowerable to a communication iff only one axis
is reduced and this axis is sharded across devices on the **producer**
side.
Before this patch, we would mistakenly check that the axis is sharded on
**consumer** side, which led to some runtime assert error.

* Add blank impl of isLowerableToCommunication. (NVIDIA#1698)

isLowerableToCommunication is used in a few places to print error
messages or short-circuit loops. Those places appear to be places that
are intended to largely be used behind the distributed path. It's easier
to just define the API instead of trying to conditionalize all the use
sites and invent non-USE_DISTRIBUTED behavior.

* Multidevice segmenter (NVIDIA#1696)

# What
Add an option in the segmenter to segment resharding Expr in separate
singleton segment.
To trigger it, set the segmenter's options as follows:
```
    SegmentCandidateFinderOptions options{
        .run_translate_welford = false,
        .run_combine_reductions = false,
        .run_herrmann_merge = true,
        .run_final_merge = true,
        .only_segment_resharding_exprs = true};
```
and use the segmenter as follows with any (possibly dummy) inputs:
```
KernelArgumentHolder dummy_inputs;
auto segmented_fusion = SegmentCandidateFinder::segment(std::move(fusion), dummy_inputs, options);
```
If `only_segment_resharding_exprs` is set to `false` (which is the case
by default), the behavior of the segmenter is unchanged.


We also provide a quite wide testing suite to validate our
implementation.

# Why 
Resharding Exprs need to be handled differently than other Exprs because
we want them to result in posting a network collective from the host.
Therefore those expressions cannot (for now) be fused to any kernel. For
this reason, we need those Expr to be segmented before and after.

# How
_**Remark:** For now, the segmenter is only used [at one place before
scheduling and compiling the
fusion](https://github.com/NVIDIA/Fuser/blob/1603f39bab8c1bbe12e38f2b5de53dec3b7cc373/csrc/kernel_cache.cpp#L990)._

Recall that the segmenter first creates as many segments as there are
Expr and then tries to merge the neighbour segments incrementally in an
eager manner. The method
```
bool SegmentCandidateFinder::codeGenSupportedMerge(
    SegmentedGroup* group1,
    SegmentedGroup* group2) 
```
returns whether two groups can be merged (i.e. fused into one kernel). 

With the current patch, if
`SegmentCandidateFinderOptions::only_segment_resharding_exprs` is set to
`true`, then the usual behavior of `codeGenSupportedMerge` is bypassed
and the function returns whether one Expr among the groups is
resharding.

Because this segmentation shouldn't depend on the inputs data, we use
default (aka empty) `KernelArgumentHolder`, from which it is invalid to
instantiate a `SchedulerRuntimeInfo runtime_info_`. For this reason, we
had to make the latter attribute optional.

# Future/other directions

Another way to achieve the same result is to manually add segment bounds
surrounding the resharding Exprs as was suggested by @wujingyue here
NVIDIA#1571

The current implementation looks a bit "hacky" and should be be
integrated more properly once multidevice schedulers are implemented
and/or the segmenter is refactored.

Later, we might wanna be able to fuse communications and computes and
also communications between them. This would require a more advanced
segmenter and scheduler, but hopefully this patch could serve as a good
basis

# Example:
consider the fusion:
```
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  TensorView* tv0 = makeContigTensor({4});
  fusion->addInput(tv0);
  TensorView* tv1 = sum(tv0,{3});
  TensorView* tv2 = set(tv1);
  TensorView* tv3 = sum(tv2, {2});
  fusion->addOutput(tv3);
```

Manually scheduled as follows:
```
  DeviceMesh mesh ({0,1,2,3})
  for (auto tv : {tv0, tv1, tv2, tv3}) {
    tv->setDeviceMesh(mesh);
  }
  tv0->axis(0)->parallelize(ParallelType::DIDx);
  tv1->axis(0)->parallelize(ParallelType::DIDx);
```
This scheduling implies that
- `tv0` and `tv1` are fully sharded on the devices {0,1,2,3}
- `tv2` and `tv3` are fully replicated on those same devices
- consequently, the "set" operation on the line `tv2 = set(tv1)`
actually embedds an "AllGather" network collective. This Expr is
resharding while all the other exprs are not. We thus excpect this
expression to constitute an unmergeable segment.

The segmenter in this situation with the
option`SegmentCandidateFinderOptions::only_segment_resharding_exprs` set
to `true` will result in three segments:
- Compute segment 1: with the expr `tv1 = sum(tv0,{3})`
- Communication segment 1:  with the expr `tv2 = set(tv1)`
- Compute segment 2: with the expr `tv3 = sum(tv2, {2})`

* Vectorization Factor patch for computeInfoC2P with Broadcast in mapped IterDomain (NVIDIA#1625)

Fixes NVIDIA#1567

This PR patches vectorization factor in
`ContiguousInnerDimensionsMapper::computeInfoC2P`.

Handling of resolved broadcast dimension should be made on mapped
consumer tensors' from_ids, instead of the root_domain order. Added a
few tests per @zasdfgbnm 's suggestion:

```
Case 0:
T2[1024, 2, 512] = T0[1024, 2, 1] + T1[1024, 2, 512]
allocation = rfactor
--> T0 has no vectorization

Case 1:
T2[1024, 512, 2] = T0[1024, 1, 2] + T1[1024, 512, 2]
allocation = rfactor
--> T0 has vectorization 2

Case 2:
T2[1024, 512, 2] = T0[1024, 1, 2] + T1[1024, 512, 2];
T3[512, 1024, 2] = transpose(T2[1024, 512, 2])
allocation = rfactor
*except T1 has stride_order {1, 2, 0}
--> T0 has vectorization 4

Case 3:
T2[512, 1024, 2] = T0[1, 1024, 2] + T1[512, 1024, 2]
T3[1024, 512, 2] = transpose(T2[512, 1024, 2])
allocation = rfactor
--> T0 has vectorization 2
```

---------

Co-authored-by: Jacob Hinkle <[email protected]>
Co-authored-by: Gao, Xiang <[email protected]>

* transpose scheduler fix: reduction IterDomain on input tensors (NVIDIA#1661)

Fixes NVIDIA#1659 

Reorders reduction IterDomain so it won't interfere with
scheduling tiling from transpose scheduler.

* Convert reduction of expanded dims to squeeze (NVIDIA#1679)

See comment in arith.cpp for details.

One controversial change here is to allow squeezing expanded dimensions,
both in our IR's `SqueezeOp` and in the user-facing functions `squeeze`.
This results in actually removing those dimensions. This behavior
diverges from PyTorch, whose `squeeze` command will ignore requested
squeezes if the size is not 1 regardless of whether that dimension is
expanded. I'm happy to discuss this change and potentially take another
course, but I think we do need to be able to remove expanded axes (see
NVIDIA#1174 (comment) for
another case where I encountered this limitation).

Fixes NVIDIA#1678

* Make sure ValGraphs are created deterministically (NVIDIA#1714)

While I was working on NVIDIA#32, I sometimes saw non-deterministic results.
Hope this is the only source of non-determinism.

* Fix squeeze-related errors (NVIDIA#1717)

This fixes current failures in `pytest_ops.py -k squeeze` and some
integration failues.

This restores our previous semantics for squeeze, which **do not match
PyTorch**. Namely, if squeeze is provided a dimension that cannot be
squeezed, we will always raise an error.

* NVFUSER_DISTRIBUTED instead of USE_DISTRIBUTED (NVIDIA#1711)

* Add the missing `clang-format on` and reformat. (NVIDIA#1722)

* Print a newline before the header. (NVIDIA#1720)

* Associate each fusion cache with its local rank in distributed setting. (NVIDIA#1699)

### Problem:
Currently, automatic serialization saves a single cache regardless of
the number of devices. In a distributed setting, each process restores
its fusion cache from the same common workspace. However, this workspace
only contains the CUDA kernels for a single device. The remaining
processes must recompile the kernels for their devices.

### Solution:
A separate process is created for each device with `ddp` or `fsdp` and
each process contains a separate `FusionCache`. This PR associates each
fusion cache with its local rank in a distributed setting, allowing
automatic serialization to create a separate workspace for each device.
During deserialization, each process loads the workspace associated with
its local rank.

* Vectorized serial grid reduction (NVIDIA#1528)

This change allows us to use vectorized loads/stores in
`serialReductionStep`. The generated kernel now looks like
```c++
  NVFUSER_UPDATE_MAGIC_ZERO;                                        
  grid_sync::blockSerializeWait<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]);
  #pragma unroll                                                                                                                         
  for(nvfuser_index_t i16 = 0; i16 < 4LL; ++i16) {                                                                                           nvfuser_index_t i17;                                                                                                                 
    i17 = 32LL * i16;                                                                                                                        nvfuser_index_t i18;                                                                                                                 
    i18 = 4096LL * i16;                                                                                                                  
    nvfuser_index_t i19;                                                                                                                 
    i19 = i5 + i18;                                                                                                                      
    nvfuser_index_t i20;                                                                                                                 
    i20 = -i18;                                                                                                                          
    #pragma unroll                                                                                                                       
    for(nvfuser_index_t i21 = 0; i21 < 8LL; ++i21) {                                                                                     
      nvfuser_index_t i22;                                                                                                               
      i22 = 512LL * (i21 + nvfuser_zero);                                                                                                
      Array<float, 4LL, 4> T3;                                                                                                           
      T3.set(float(0.000000000e+00f));                                                                                                   
      reduction::serialReductionStep</*vec_size=*/4>(                                                                                    
        &T3[0LL],                                                                                                                        
        &T2[(i17 + (4LL * i21))],                                                                                                        
        0.000000000e+00f,                                                                                                                
        &T6[(i19 + i22)],                                                                                                                
        [](float &a, float b) { a = a + b; },                                                                                            
        index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0,
        index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1,
        true,                                                                                                                                    true);                                                                                                                           
      if ((b7 && (i6 < (i20 - i22)))) {                                                                                                  
        loadLocalToGlobal<float, /*vec_size=*/4, /*is_volatile=*/false>( &T1[(i19 + i22)], &T3[0LL]);                                    
      }                                                                                                                                  
    }                                                                                                                                    
  }                                                                                                                                      
  grid_sync::blockSerializeRelease<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]);            
  NVFUSER_UPDATE_MAGIC_ZERO;       
```

* removing out-dated assert on python API (NVIDIA#1724)

removing out-dated asserts in python API `define_vector`;
adding a tests verifying the behavior

* make ci green again (NVIDIA#1730)

skip failing test.

Please enable it once we patch NVIDIA#1728

* Remove unnecessary `MATCHER_P`. (NVIDIA#1729)

* Fix Issue NVIDIA#1734 (NVIDIA#1735)

Closes Issue NVIDIA#1734

* Rename `AliasType` -> `AllocationType` (NVIDIA#1732)

* Skip executing a kernel if it's empty. (NVIDIA#1723)

I could change `compileFusion` to skip compilation as well. It turned
out to be more complicated than I expected, so I took the easier route
to skip just execution, which is at least an incremental improvement.

* don't cache slice input tv (NVIDIA#1705)

If the input tv is used by slice, don't cache it.
Fix NVIDIA#1697

* Make `MmaOp::evaluate` return output of the same dtype as `MmaOp` (NVIDIA#1733)

* Turing/Ampere Mma tests without `BroadcastOp` (NVIDIA#1672)

This PR renames `matmulAtInput` into `matmulAtInput2D`, explicitly
showing that it generates 2D inputs. This PR also adds a
`matmulAtInput3DTuring`, which is used to generate the 3D fusion inputs
(for example `[M, 1, K]` and `[1, K, N]`) for matmul. The `MmaTest` for
Turing and Ampere is modified to exclude the `BroadcastOp` and use the
3D version for generating fusion inputs. This is only the initial step
for making `scheduleMatmul` schedule a fusion not containing
`BroadcastOp`, I intentionally keep it small. Other changes will be
added in followup PRs.

Fixes NVIDIA#1628

* io_alias_ const update (NVIDIA#1740)

* Add benchmarks for RoPE. (NVIDIA#1739)

This PR adds two implementations of the RoPE module and benchmarks them
for NVIDIA#1597.

`rope_with_cat_fusion` mimics the Hugging Face implementation.
`rope_without_cat_fusion` implements an idea from @nikitaved to avoid
concatenation. Even though it looks difficult for the compiler to do it
all automatically, it's still useful to keep a record of the idea.

As a side change, I made `fd.define_tensor` to accept empty contiguity.

* Make nvfuser matmul benchmarks HSH instead of HSS (NVIDIA#1712)

This matches the `at::matmul` baselines.

This PR also adds a few more problem sizes, and runs each eagermode
baseline with and without FP16 reduction allowed.

* Reduce number of `MmaTest`s (NVIDIA#1738)

This PR is stacked on top of NVIDIA#1672

Turing/Ampere mma is only TN, so it makes no sense to test other layouts
in `MmaTest`s. These tests are intended to test mma instructions,
`ldmatrix` and `ldmatrix.trans` is tested separately in other unit
tests. Similar for `HopperRS` tests.

* Weekly Benchmarks Input Range (NVIDIA#1708)

* Rename axes= to dims= in frontend (NVIDIA#1741)

Currently we accept `axes=` for some ops like `fd.ops.sum` and `dims=`
for others like `fd.ops.squeeze`.

This is a small attempt to make the frontend arguments more consistent.
This change renames the `axis=` kwarg to `dim=` and the same for `axes=`
-> `dims=`.

I think we're free to set our own convention, but for reference:
- PyTorch uses `dim=` in most places and accepts either a single dim or
multiple using that same argument name, where applicable.
- Numpy uses `axis=` and, like PyTorch, accepts a list where applicable.
- `jax.lax` uses `dimensions=`

* Avoid unused smem workspace for serial grid reductions (NVIDIA#1727)

GridReduction can be lowered to either `gridReduce` or
`serialReductionStep`. `gridReduce` requires a smem workspace in order
to use multiple threads to aggregate partial sums. However,
`serialReductionStep` does not coordinate among threads and has no use
for a workspace. This change simply disables allocating that little bit
of extra shared memory if our only grid reductions are serial, which
currently only happens in split-K GEMM.

This reduces the smem allocated in a simple test from 16896 B to 16384 B
(about 97%). More importantly, this makes the computation in
`mma_utils::generateSharedMemoryEpilogueHeuristics()` more accurate.
Tests are updated to check that this computation is accurate.

The change in `kernel.cpp` is responsible for reducing actual smem usage
for split-K. The changes to `mma_utils` and `test_gpu_tensorcore.cpp`
are needed for adding testing that our expected smem usage matches the
actual usage.

* Issue NVIDIA#1748 (NVIDIA#1749)

Closes Issue NVIDIA#1748.
Apart from `c10::cuda::GetDevice`, no other functionality seems
affected.

* Rename `axes` to `dims` in benchmarks fusion definitions (NVIDIA#1751)

Changes the kwarg `axes` to `dims` following the API change in PR NVIDIA#1741.

* Bump matmul benchmark checkMatch() tolerance (NVIDIA#1747)

This is necessary due to recent switch to HSH

Fixes NVIDIA#1746

* linter

* change guard USE_DISTRIBUTED to NVFUSER_DISTRIBUTED in test/test_multidevice_sharding.cpp

* linting

* linter and cleanup

* remove allocator.h/cpp files

* Device index patch (NVIDIA#1752)

Fixes NVIDIA#1748 

guard c10::cuda::GetDevice API change on TORCH_VERSION

with this change, it ensures that we can build against stable release `<
2.2.0`, as well as TOT after
pytorch/pytorch#119142

For 2.3.0 nightly, if someone accidentally checkout a commit before the
patch, the build will still fail.

* fixing multidevice build (NVIDIA#1753)

API change coming from pytorch/pytorch#119421

* patching API GUARD (NVIDIA#1754)

patching API version guard so we'll still be able to build against older
pytorch version.

* Add a visitor for ValGraph (NVIDIA#1713)

Used in the loop promotion analysis. Extracted from NVIDIA#32

* empty commit for triggering CI

---------

Co-authored-by: Liqiang Lu <[email protected]>
Co-authored-by: Jacob Hinkle <[email protected]>
Co-authored-by: Priya Mishra <[email protected]>
Co-authored-by: Jingyue Wu <[email protected]>
Co-authored-by: Tom Fogal <[email protected]>
Co-authored-by: jjsjann123 <[email protected]>
Co-authored-by: Gao, Xiang <[email protected]>
Co-authored-by: Naoya Maruyama <[email protected]>
Co-authored-by: Meghan Cowan <[email protected]>
Co-authored-by: Ryan Spring <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant