Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TorchToStablehlo lit test failure with scatter.mlir #1962

Closed
sjain-stanford opened this issue Jan 28, 2024 · 11 comments · Fixed by #1972
Closed

TorchToStablehlo lit test failure with scatter.mlir #1962

sjain-stanford opened this issue Jan 28, 2024 · 11 comments · Fixed by #1972
Assignees

Comments

@sjain-stanford
Copy link
Contributor

sjain-stanford commented Jan 28, 2024

What happened?

Hi ⁠stablehlo, we need some help on the torch-mlir side of things with a lit test error we hit when bumping stablehlo that hits an unexpected error: Expects non-empty reduction block for type inference

PR: llvm/torch-mlir#2821
Failing CI: https://github.com/llvm/torch-mlir/actions/runs/7687361450/job/20947291128?pr=2821

Here's the error:

bazel run @torch-mlir//test/Conversion:TorchToStablehlo/scatter.mlir.test 

...external/torch-mlir/test/Conversion/TorchToStablehlo/scatter.mlir
within split at <stdin>:1 offset :33:8: error: unexpected error: Expects non-empty reduction block for type inference                                                                               
  %0 = torch.aten.scatter.src %arg0, %int0, %arg1, %arg2 : !torch.vtensor<[?,?],si64>, !torch.int, !torch.vtensor<[?,?],si64>, !torch.vtensor<[?,?],si64> -> !torch.vtensor<[?,?],si64>             
       ^                                                                                                             
LLVM ERROR: Failed to infer result type(s).

Seems like this check was added in #1918 but I lack the historic context to know what needs fixing.

Steps to reproduce your issue

  1. Clone https://github.com/sjain-stanford/torch-mlir and checkout sambhav/bazel_fix branch
  2. Fetch git submodules: git submodule update --init --progress
  3. Launch docker: ./utils/bazel/docker/run_docker.sh
  4. Test: bazel run @torch-mlir//test/Conversion:TorchToStablehlo/scatter.mlir.test

Version information

No response

@sdasgup3
Copy link
Member

Hi @sjain-stanford
Thanks for posting the issue!

#1869 allows the block arguments of reduce op to have different element types than that of the input arguments of reduce op and the output type of the reduce op has to equal to those block arguments. As a consequence the output type of reduce op can no longer be inferred from the operand types: The type inference needs to somehow have the information regarding the block-arguments's element type. The check that was added at #1918 is a guard against inferring reduce op type with empty blocks, which is what is happening at https://sourcegraph.com/github.com/llvm/torch-mlir/-/blob/lib/Conversion/TorchToStablehlo/GatherScatter.cpp?L336.

We encountered a similar problem with creating mhlo reduce op with empty reduction block and the solution we proposed was to create a custom builder here and here.

In StableHLO we do not have such builder as we do not see a use-case for that until now with the issue you posted. IMO the solution will be along the similar lines and I will be working on that.

I will keep you posted on updates.

@sdasgup3 sdasgup3 self-assigned this Jan 29, 2024
@sjain-stanford
Copy link
Contributor Author

Thanks @sdasgup3 for looking into this. PLMK if/when you have a patch-set you'd like me to test on the torch-mlir side of things, happy to help.

@sdasgup3
Copy link
Member

sdasgup3 commented Jan 29, 2024

@sjain-stanford

Please find the patch at #1965.

On the torch-mlir side, we need the following patch to make sure that the newly supported builder is invoked.

--- a/lib/Conversion/TorchToStablehlo/GatherScatter.cpp
+++ b/lib/Conversion/TorchToStablehlo/GatherScatter.cpp
@@ -334,7 +334,8 @@ LogicalResult ConvertAtenOp<AtenEmbeddingBagPaddingIdxOp>::matchAndRewrite(
     return failure();
 
   auto stablehloReduceOp = rewriter.create<stablehlo::ReduceOp>(
-      op.getLoc(), gatherOutput, initValue, rewriter.getI64TensorAttr({0}));
+      op.getLoc(), gatherOutput, initValue, rewriter.getDenseI64ArrayAttr({0}),
+      elementTy);
 
   Region &region = stablehloReduceOp.getBody();
   Block &block = region.emplaceBlock();
@@ -666,7 +667,8 @@ LogicalResult ConvertAtenOp<AtenScatterSrcOp>::matchAndRewrite(
       /*indexVectorDim=*/indexVecDim);
 
   auto stablehloScatterOp = rewriter.create<stablehlo::ScatterOp>(
-      loc, input, scatterIndicies, src, scatterDimensionNumbers, false, false);
+      loc, inputType, input, scatterIndicies, src, scatterDimensionNumbers,
+      false, false);
 
   // config update computation function: just return the element from src.
   Block &block = stablehloScatterOp.getUpdateComputation().emplaceBlock();

Please let us know if this works at your end.

@sjain-stanford
Copy link
Contributor Author

@sdasgup3 Thanks for the quick fix. Confirming the test passes locally with your patches 👍🏻 (cherry-picked the stablehlo changes from your fork but will re-apply once it lands on main)

@sdasgup3
Copy link
Member

@sjain-stanford Please let us know if the merged #1965 worked at your end.

@sjain-stanford
Copy link
Contributor Author

@sdasgup3 any idea how this piece of code needs to be refactored?

  DenseIntElementsAttr bcast_attr = DenseIntElementsAttr::get(
      RankedTensorType::get({static_cast<long int>(bcastDims.size())},
                            rewriter.getI64Type()),
      bcastDims);
  auto bcast_op = rewriter.create<stablehlo::BroadcastInDimOp>(
      op->getLoc(), outType, input, bcast_attr);

After bumping to stablehlo@HEAD, I hit some new build errors like so:

external/llvm-project/mlir/include/mlir/IR/Builders.h:508:11: error: no matching member function for call to 'build'                                                                                
    OpTy::build(*this, state, std::forward<Args>(args)...);                                                                                                                                         
    ~~~~~~^~~~~                                                                                                                                                                                     
external/torch-mlir/lib/Conversion/TorchToStablehlo/StablehloLegalizeUtils.cpp:248:28: note: in instantiation of function template specialization 'mlir::OpBuilder::create<mlir::stablehlo::BroadcastInDimOp, mlir::TensorType &, mlir::Value &, mlir::DenseIntElementsAttr &>' requested here                                                                                                          
  auto bcast_op = rewriter.create<stablehlo::BroadcastInDimOp>(                                                                                                                                     
                           ^                                                                                                                                                                        
bazel-out/k8-fastbuild/bin/external/stablehlo/stablehlo/dialect/StablehloOps.h.inc:2438:15: note: candidate function not viable: no known conversion from 'mlir::DenseIntElementsAttr' to '::mlir::DenseI64ArrayAttr' (aka 'DenseArrayAttrImpl<long>') for 5th argument                                                                                                                                 
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Type resultType0, ::mlir::Value operand, ::mlir::DenseI64ArrayAttr broadcast_dimensions);              
              ^                                                                                                                                                                                     
bazel-out/k8-fastbuild/bin/external/stablehlo/stablehlo/dialect/StablehloOps.h.inc:2440:15: note: candidate function not viable: no known conversion from 'mlir::DenseIntElementsAttr' to '::llvm::ArrayRef<int64_t>' (aka 'ArrayRef<long>') for 5th argument                                                                                                                                           
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Type resultType0, ::mlir::Value operand, ::llvm::ArrayRef<int64_t> broadcast_dimensions);              
              ^                                                                                                                                                                                     
bazel-out/k8-fastbuild/bin/external/stablehlo/stablehlo/dialect/StablehloOps.h.inc:2439:15: note: candidate function not viable: no known conversion from 'mlir::DenseIntElementsAttr' to '::mlir::DenseI64ArrayAttr' (aka 'DenseArrayAttrImpl<long>') for 5th argument                                                                                                                                 
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::Value operand, ::mlir::DenseI64ArrayAttr broadcast_dimensions);         
              ^                                                                      

@sdasgup3
Copy link
Member

sdasgup3 commented Jan 31, 2024

@sjain-stanford
Can you try this?

 auto bcast_op = rewriter.create<stablehlo::BroadcastInDimOp>(
      op->getLoc(), outType, input,
      rewriter.getI64TensorAttr(bcastDims));

cc @mlevesquedion

@sjain-stanford
Copy link
Contributor Author

rewriter.getDenseI64ArrayAttr(bcastDims) seems to fix it, and realized the error message was pretty specific/helpful. A lot more conversions needed an update, I think this is all of it: llvm/torch-mlir@a0ad9a8, please take a look to verify it looks good. The build works, I have one lit test that needs an update.

@sjain-stanford
Copy link
Contributor Author

The CMake build now fails with setup_sanitizers. Probably related to #1959 (cc: @fzakaria , @mlevesquedion ). I see the default for STABLEHLO_ENABLE_SANITIZER being OFF, but this code probably needs to be enclosed in an if-endif:

include(SetupSanitizers)
setup_sanitizers()

Error:

  -- Building StableHLO as an external LLVM project
  -- Building with -fPIC
  CMake Error at /_work/torch-mlir/torch-mlir/externals/stablehlo/CMakeLists.txt:147 (include):
    include could not find requested file:
  
      SetupSanitizers
  
  
  CMake Error at /_work/torch-mlir/torch-mlir/externals/stablehlo/CMakeLists.txt:148 (setup_sanitizers):
    Unknown CMake command "setup_sanitizers".
  
  
  -- Configuring incomplete, errors occurred!

@fzakaria
Copy link
Contributor

I'll take a look.

@sjain-stanford
Copy link
Contributor Author

This fixes it: #1972

GleasonK pushed a commit that referenced this issue Jan 31, 2024
Fixes
#1962 (comment)

Context: 
This breaks torch-mlir's CMake build when bumping stablehlo. 

The CMake build
[fails](https://github.com/llvm/torch-mlir/actions/runs/7729517677/job/21072878412?pr=2821)
with setup_sanitizers. Probably related to
#1959 (cc: @fzakaria ,
@mlevesquedion ). I see the default for `STABLEHLO_ENABLE_SANITIZER`
being `OFF`, but this code probably needs to be enclosed in an
`if-endif`:
```
include(SetupSanitizers)
setup_sanitizers()
```
Error:
```
  -- Building StableHLO as an external LLVM project
  -- Building with -fPIC
  CMake Error at /_work/torch-mlir/torch-mlir/externals/stablehlo/CMakeLists.txt:147 (include):
    include could not find requested file:
  
      SetupSanitizers
  
  
  CMake Error at /_work/torch-mlir/torch-mlir/externals/stablehlo/CMakeLists.txt:148 (setup_sanitizers):
    Unknown CMake command "setup_sanitizers".
  
  
  -- Configuring incomplete, errors occurred!
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants