Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix merge error and compile error and bugs in buildInitValue #13

Merged
merged 1 commit into from
May 30, 2024

Conversation

minjang
Copy link
Collaborator

@minjang minjang commented May 30, 2024

My bad. I won't resolve merge conflicts in the github editor directly.

And I fixed/updated some code in buildInitValue. See the inlined comments.

else if (kind == vector::CombiningKind::MINSI)
initVal = rewriter.getIntegerAttr(
elemTy, static_cast<int64_t>(
1UL << (elemTy.getIntOrFloatBitWidth() - 1) - 1));
(1UL << (elemTy.getIntOrFloatBitWidth() - 1)) - 1));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang compiler complains this:

Operator '<<' has lower precedence than '-'; '-' will be evaluated first (fix available)clang(-Wshift-op-parentheses)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Do you think we also need to change 1UL to 1ULL or use an explicit cast to uint64_t to have all compilers covered?

@@ -149,11 +149,11 @@ struct ReduceOpConversion : public OpConversionPattern<triton::ReduceOp> {
else if (kind == vector::CombiningKind::MAXSI)
initVal = rewriter.getIntegerAttr(
elemTy,
static_cast<int64_t>(1UL << (elemTy.getIntOrFloatBitWidth() - 1)));
static_cast<int64_t>(-(1UL << (elemTy.getIntOrFloatBitWidth() - 1))));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a sort of INT_MIN. But super interestingly, the dump of initVal remained the same? But what do you think? I think this explicit - is clearer for signed int64_t?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the point here to always have negative int64_t passed to the attribute? The dump is similar because with your change you modify only those bits that don't matter for the used type.

@@ -168,10 +168,10 @@ struct ReduceOpConversion : public OpConversionPattern<triton::ReduceOp> {
kind == vector::CombiningKind::MAXNUMF) {
if (elemTy.isF32())
initVal =
rewriter.getF32FloatAttr(-std::numeric_limits<float>::infinity());
rewriter.getF32FloatAttr(std::numeric_limits<float>::lowest());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lowest is clearer as well. Let me know this is okay for you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what our semantics should be for corner cases like when all input elements are NaN. But I'd expect -INF to be returned instead of lowest() in case all inputs are -INF.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looking at more -INF looks good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking more thoroughly about corner cases, I realized we should use different init values for MAXIMUMF and MAXNUMF. To correctly cover all NaNs in input, we should use NaN as init value for MAXIMUMF (any not NaN would replace it then) and -INF for MAXNUMF (any NaN would replace it then). Similar for MINIMUMF and MINNUMF. What do you think?

Copy link
Collaborator Author

@minjang minjang May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ienkovich Alright, we totally missed it. We can simply get the first element of the source (using vector::ExtractOp?). This would delete the entire buildInitValue .

I confirmed that the GPU also does like that: see ReduceOpConversion::accumulate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to choose init value so that the operation to combine it with the reduction result is NOP. I'm not sure the compiler would be able to optimize-out element extraction and its combination with the reduction result.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, having a constant init value could result in highly optimized code after instruction combines, where the extreme case would be NOP? Do I correctly understand your idea? Right. Element-extraction is an indirection, so compiler is hard to optimize out. In that sense, having a constant init value is good! (extracting the first element would be a fallback then.)

Another question: should we consider NaN as input? How much is this important? I was thinking NaN was deemed an erroneous state in valid tensors? +/-INF are definitely valid. Hm, NaN can be compared.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are unit tests that check kernel results for NaN inputs. Didn't see such a reduction though.

I'm currently improving the reduction lowering code to cover more cases. Will also fix the init value there.

@minjang minjang marked this pull request as ready for review May 30, 2024 17:02
@minjang minjang requested a review from ptillet as a code owner May 30, 2024 17:02
@minjang minjang force-pushed the fix-merge-and-compile-errors branch from 58356d5 to f2a6e60 Compare May 30, 2024 23:19
@ienkovich ienkovich merged commit aa4bb31 into triton-lang:main May 30, 2024
@minjang minjang deleted the fix-merge-and-compile-errors branch June 13, 2024 05:02
minjang added a commit to minjang/triton-cpu that referenced this pull request Jun 22, 2024
minjang pushed a commit that referenced this pull request Jun 24, 2024
When running
[convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924)
Triton ends up computing a rank of a matrix with 0 columns during linear
layout lowering, which trips up f2reduce, and causes undefined behavior,
detectable through
[UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html).

Fix this by returning the rank (0) early in these cases, without calling
f2reduce.

<details><summary>Stack trace</summary>
<p>

```
third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long'
    #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30
    #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9
    #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3
    #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7
    #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41
    #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51
    #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14
    #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8
    #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19
    #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24
    #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7
    #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
    #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
    #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5
    #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9
    #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10
    #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12
    #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16
    #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5
    #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5
    #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3
    #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26
    #22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7
    #23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7
    #24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5
    #25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22
...
UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 
```
</p>
</details>
minjang added a commit that referenced this pull request Jun 24, 2024
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Aug 13, 2024
int3 pushed a commit that referenced this pull request Aug 29, 2024
minjang added a commit that referenced this pull request Sep 22, 2024
minjang added a commit that referenced this pull request Oct 22, 2024
minjang added a commit that referenced this pull request Oct 24, 2024
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Nov 13, 2024
Extends contraction lowering to XSMM by rewriting plain GEMM into
a BRGEMM kernel when possible.

The rewrite improves performance of larger K block sizes thanks to
extra reduction dim tiling. Use of BRGEMM kernel also enables online
VNNI packing for BF16.
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Nov 13, 2024
Extends contraction lowering to XSMM by rewriting plain GEMM into
a BRGEMM kernel when possible.

The rewrite improves performance of larger K block sizes thanks to
extra reduction dim tiling. Use of BRGEMM kernel also enables online
VNNI packing for BF16.
ienkovich pushed a commit to ienkovich/triton-cpu that referenced this pull request Nov 20, 2024
Extends contraction lowering to XSMM by rewriting plain GEMM into
a BRGEMM kernel when possible.

The rewrite improves performance of larger K block sizes thanks to
extra reduction dim tiling. Use of BRGEMM kernel also enables online
VNNI packing for BF16.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants