Enable conversion of all_reduce and GSPMD custom_op into TTIR dialect #1351

wooseokTT · 2024-11-20T18:55:39Z

Overall Plan

As a first step of multi-device support plan, this PR allows to convert the MLIR outputs that target all_reduce op from JAX/OpenXLA(GSPMD)/PJRT. There will be following several PRs, which allow the computation flows from TTIR down to runtime. Detailed steps are as follows.

(1) Convert MLIRs from JAX/OpenXLA/PJRT to TTIR (this PR)
(2) Pass converted TTIR to TTNN MLIR and Flatbuffer format
(3) Parse TTNN flatbuffer and execute in TT Runtime

Although current version of code is targeting GSPMD partitioned MLIRs, our future plan mainly aims at supporting Shardy-based JAX/PJRT MLIRs.

Implementation Details

In general, GSPMD partitioned MLIR has following computation pattern,

A. Shard inputs for computation in multi-device
%0 = stablehlo.custom_call @sharding(%arg0) {mhlo.sharding = "{devices=[2,4]<=[8]}"} : (tensor<...>) -> tensor<...>
%1 = stablehlo.custom_call @SPMDFullToShardShape(%0) {mhlo.sharding = "{manual}"} : (tensor<...>) -> tensor<...>

B. Simultaneous compute on multiple devices
%0 = stablehlo.dot_general %arg0, %arg1, contracting_dims = [1] x [0], ... : (tensor<...>, ) -> tensor<...>

C. Merge partial computation results using CCL ops
%1 = "stablehlo.all_reduce"(%0) < ... > ( ... ):
%2 = stablehlo.add %arg2, %arg3 : tensor
stablehlo.return %2 : tensor
}) : (tensor<4096x16384xf32>) -> tensor<4096x16384xf32>

D. Concat outputs if needed
%5 = stablehlo.custom_call @sharding(%4) {mhlo.sharding = "{manual}"} : (tensor<...>) -> tensor<...>
%6 = stablehlo.custom_call @SPMDShardToFullShape(%5) {mhlo.sharding = "{devices=[2,1,4]<=[8] last_tile_dim_replicate}"} : (tensor<...>) -> tensor<...>

Currently, we can convert B, so this PR convert A, C, and D parts.
For C, we need to introduce TTIR all_reduce op while for A and D, we introduce new TTIR mesh_shard op.

Changes in this PR

TT_Reduce_Type is created to share computation type with TTNN dialect
AllReduceOp in TTIR is introduced to accommodate stableHLO all_reduce op
MeshShardOp in TTIR is introduced to capture GSPMD custom sharding functions
Realistic test cases are added from JAX/PJRT output

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/6)

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (2/6)

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (3/6)

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (4/6)

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (5/6)

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (6/6)

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

nsmithtt

Looks great! Minor syntactic nits

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

nsmithtt

Two additional minor nits, otherwise looks good!

lib/Dialect/TTIR/IR/TTIROps.cpp

include/ttmlir/Dialect/TTIR/IR/TTIROps.td

mtopalovicTT · 2024-11-22T09:43:56Z

include/ttmlir/Dialect/TTIR/IR/TTIROps.td

+def TTIR_MeshShardOp : TTIR_DPSOp<"mesh_shard"> {
+    let summary = "Mesh shard operation";
+    let description = [{
+      MeshShard op


More descriptive please with some example... Same for all other ops

Sure. I will give more detailed descriptions about the ops...

Added detailed description. Let me know if you need further changes.

mtopalovicTT · 2024-11-22T10:07:11Z

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

+template <typename srcOpTy>
+LogicalResult getReduceType(srcOpTy &srcOp, ReduceType &reduceType) {
+  if constexpr (!std::is_same<srcOpTy, mlir::stablehlo::AllReduceOp>::value) {
+    return failure();
+  }


Would it make sense just to specialize this function for ReduceOp so you will get compile error instead of getting pass error during runtime?

I think there are multiple stablehlo multi-device funcs that could reuse this function in the future. @wooseokTT, feel free to correct me if I'm wrong.

Yes. We are planning to land further ccl ops including reduce_scatter that uses this function in near future. The computation ops will be embedded into all_reduce/reduce_scatter ops as a type of computation attribute, and they will be gone in the mlir. So, I would assume that pass error makes sense to me.

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

mtopalovicTT

@nsmithtt @wooseokTT

Could you provide a more detailed PR description or perhaps a short document outlining what this PR introduces? Since we’re introducing new concepts into TTIR, it would be really helpful to have a reference that explains the ideas around multi-device functionality and some implementation details. This would make it easier for those who are less familiar with the changes to understand the context and goals.

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp

1. TT_Reduce_Type is created to share compution type with TTNN dialect 2. AllReduceOp in TTIR is introdcued to accomodate stableHLO all_reduce op 3. MeshShardOp in TTIR is introduced to capture GSPMD custom sharding 4. Realistic test cases are added from JAX/PJRT output Current verion of importing is targetting GSPMD input, but our future plans mainly focus on supporting Shardy-based JAX/PJRT output.

wooseokTT · 2024-11-25T15:37:49Z

@nsmithtt @wooseokTT

Could you provide a more detailed PR description or perhaps a short document outlining what this PR introduces? Since we’re introducing new concepts into TTIR, it would be really helpful to have a reference that explains the ideas around multi-device functionality and some implementation details. This would make it easier for those who are less familiar with the changes to understand the context and goals.

@mtopalovicTT I updated the PR description with the details. Let me know if you need any further descriptions. TT-MLIR is actively evolving now, so further PRs will specify more concrete concepts and details, I believe.

mtopalovicTT · 2024-11-26T09:43:56Z

@wooseokTT Thanks PR description is awesome. This makes stuff a lot more clear.

wooseokTT requested review from sdjordjevicTT, nsmithtt, rpavlovicTT, svuckovicTT, mtopalovicTT, mrakitaTT, nobradovictt and jserbedzijaTT as code owners November 20, 2024 18:55

github-actions bot reviewed Nov 20, 2024

View reviewed changes

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp Outdated Show resolved Hide resolved

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp Show resolved Hide resolved

wooseokTT force-pushed the wooseok/add_stablehlo_ccl_op_support branch 2 times, most recently from eb22ca1 to f944c3b Compare November 20, 2024 20:44

nsmithtt reviewed Nov 20, 2024

View reviewed changes

wooseokTT force-pushed the wooseok/add_stablehlo_ccl_op_support branch from f944c3b to 6b8d573 Compare November 21, 2024 18:51

This was linked to issues Nov 21, 2024

Add CCL ops to TTIR Dialect #923

Closed

Push Jax test through #924

Open

nsmithtt approved these changes Nov 22, 2024

View reviewed changes

lib/Dialect/TTIR/IR/TTIROps.cpp Outdated Show resolved Hide resolved

include/ttmlir/Dialect/TTIR/IR/TTIROps.td Outdated Show resolved Hide resolved

mtopalovicTT reviewed Nov 22, 2024

View reviewed changes

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp Outdated Show resolved Hide resolved

mtopalovicTT reviewed Nov 22, 2024

View reviewed changes

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp Outdated Show resolved Hide resolved

mtopalovicTT requested changes Nov 22, 2024

View reviewed changes

mtopalovicTT reviewed Nov 22, 2024

View reviewed changes

lib/Conversion/StableHLOToTTIR/StableHLOToTTIRPatterns.cpp Outdated Show resolved Hide resolved

wooseokTT force-pushed the wooseok/add_stablehlo_ccl_op_support branch from 6b8d573 to 28bcbe8 Compare November 23, 2024 00:18

wooseokTT force-pushed the wooseok/add_stablehlo_ccl_op_support branch from 28bcbe8 to 3b9531e Compare November 23, 2024 00:22

mtopalovicTT approved these changes Nov 26, 2024

View reviewed changes

wooseokTT merged commit 3d029b6 into main Nov 26, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable conversion of all_reduce and GSPMD custom_op into TTIR dialect #1351

Enable conversion of all_reduce and GSPMD custom_op into TTIR dialect #1351

wooseokTT commented Nov 20, 2024 •

edited

Loading

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

nsmithtt left a comment

nsmithtt left a comment

mtopalovicTT Nov 22, 2024

wooseokTT Nov 22, 2024

wooseokTT Nov 25, 2024

mtopalovicTT Nov 22, 2024 •

edited

Loading

nsmithtt Nov 22, 2024

wooseokTT Nov 22, 2024

mtopalovicTT left a comment

wooseokTT commented Nov 25, 2024 •

edited

Loading

mtopalovicTT commented Nov 26, 2024

Enable conversion of all_reduce and GSPMD custom_op into TTIR dialect #1351

Enable conversion of all_reduce and GSPMD custom_op into TTIR dialect #1351

Conversation

wooseokTT commented Nov 20, 2024 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

nsmithtt left a comment

Choose a reason for hiding this comment

nsmithtt left a comment

Choose a reason for hiding this comment

mtopalovicTT Nov 22, 2024

Choose a reason for hiding this comment

wooseokTT Nov 22, 2024

Choose a reason for hiding this comment

wooseokTT Nov 25, 2024

Choose a reason for hiding this comment

mtopalovicTT Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

nsmithtt Nov 22, 2024

Choose a reason for hiding this comment

wooseokTT Nov 22, 2024

Choose a reason for hiding this comment

mtopalovicTT left a comment

Choose a reason for hiding this comment

wooseokTT commented Nov 25, 2024 • edited Loading

mtopalovicTT commented Nov 26, 2024

wooseokTT commented Nov 20, 2024 •

edited

Loading

mtopalovicTT Nov 22, 2024 •

edited

Loading

wooseokTT commented Nov 25, 2024 •

edited

Loading