Enable mnist sharding with layout overrides #894

odjuricicTT · 2024-10-10T13:49:49Z

Mnist sharding works with a few caveats:

All op layouts are overriden.
Only works if all dims are pre tiled. Waiting for support for memref TileType to be propagated properly from TTIR to the backed. Related: TTNN IR tensor LayoutAttr has element type as scalar (TTNN ROW_MAJOR ) for every op #822
Workaround for ttnn failing to chose 1d matmul program config. [Bug Report] ttnn.matmul fails to chose program config tt-metal#13204

Generated ToLayout ops now have a suffix for location names in order to keep the names unique and be able to override op inputs as needed.

svuckovicTT · 2024-10-11T08:21:59Z

lib/Dialect/TTIR/Transforms/Passes.cpp

+    for (int64_t operandIndex = 0, numOperands = op->getNumOperands();
+         operandIndex < numOperands; ++operandIndex) {


Can't this be for (int64_t operandIndex = 0; operandIndex < op->getNumOperands(); ++operandIndex)?

I guess it's a perf optimization, do we know if it's worth it though?

svuckovicTT · 2024-10-11T08:25:49Z

runtime/lib/ttnn/operations/matmul/matmul.cpp

+// This is a workaround for the lack of program config selection in ttnn.matmul.
+// The logic here is temporary and totaly incompleate.


Do we have a workaround for emitc as well?

This is a temporary workaround just for mnist until the fix for this lands in metal. Do we run mnist trough emitc atm?

We want to :)

In general, I don't this it's okay to implement workarounds in one of the "backends" of the IR. This puts pressure on the non-workaroundED backends to play catch up. We're already deep enough with workarounds in ttrt, adding some more will only increase the gap with emitc.

We should either:

workaround this on the IR level

wait for metal to fix and uplift

provide workarounds for both (all) IR "backends"

If there's a strong enough reason (e.g. it's blocking the whole team for an important deliverable, so passing around git diff patches is not sustainable), I guess we could make an exception, but let's do that only with a strong commitment and an ETA on a fix from metal team. Currently, the linked issue tenstorrent/tt-metal#13204 doesn't seem to have an owner, nor any traction.

As discussed offline, @skrsticTT will be taking a look at the issue on metal this week. Due to the fact that his team is in the middle of a reorg, there is no ETA nor guarantee that this will be fixed shortly.

I'm all for waiting for a fix in metal, but this was prioritized as a result of the F2F discussions, so diverting the question of how urgent this is to someone with more context @nobradovictt.

This is not related to running MNIST or not running, this is related to running MNIST fully L1 sharded to gain/measure performance and evaluate optimizer implementation. MNIST is functionally being run already via generality path which includes emitc. Workaround is added based on instructions for runtime component to do so and properly tagged with corresponding blocking issue to remove it. Optimizer development must not be blocked on current state of other components.

Thanks for the context @nobradovictt

properly tagged with corresponding blocking issue to remove it

Can you tag the issue here please?

Here it is: #891

// TODO(bug #891): ttnn::matmul doesn't chose correct program config. bool setMatmul1DProgramConfig;

nsmithtt · 2024-10-11T13:59:23Z

lib/Dialect/TTIR/Transforms/Passes.cpp

+  }
+
+  return loc;
+}


Just for my own understanding, this preserves the original file location right? It results in kind of that nested loc syntax?

Yes, we use FileLineColLoc nested inside a NameLoc. This preserves the same FileLineColLoc with a new name.

lib/Dialect/TTIR/Transforms/Passes.cpp

nobradovictt · 2024-10-14T09:40:07Z

test/ttmlir/Dialect/TTNN/input_layout_loc_override.mlir

+    %0 = tensor.empty() : tensor<64x96xbf16> loc(#loc2)
+    // CHECK-DAG: %{{.*}} = "ttnn.to_device"{{.*}} loc(#[[LOC_MATMUL_IN0]])
+    // CHECK-DAG: %{{.*}} = "ttnn.to_device"{{.*}} -> tensor<128x96xbf16, #[[IN_1_LAYOUT]]> loc(#[[LOC_MATMUL_IN1]])
+    // CHECK-DAG: %{{.*}} = "ttnn.matmul"{{.*}} loc(#[[LOC_MATMUL]])


Should all these be CHECK-DAG, ie all can be out of order?

I couldn't make it work any other way. My current understanding is that because of the CHECK-DAG for matmul locations at the start, any subsequent CHECK statements start scanning from the loc definitions, which are at the end of file.

If you know of a bette way, i' open to suggestions.

odjuricicTT · 2024-10-15T08:50:34Z

@AleksKnezevic Can i get a review from runtime codeowners please?

kmabeeTT

aprove runtime, thx for discussion and workaround flag

jnie-TT · 2024-10-15T14:39:06Z

runtime/lib/ttnn/operations/matmul/matmul.cpp

+
+  if (workaround::Env::get().setMatmul1DProgramConfig &&
+      outputMemoryConfig.memory_layout ==
+          ::tt::tt_metal::TensorMemoryLayout::WIDTH_SHARDED) {


Can we add the issue number here as well?

* Generated ToLayout ops now suffix for location name * Add workaround ttnn failing to chose 1d matmul program config

odjuricicTT · 2024-10-16T09:26:21Z

@tapspatel Can you take a look at the PR please? Your approval is needed before merge.

tapspatel · 2024-10-16T12:57:09Z

runtime changes look good!

@odjuricicTT

Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT

@odjuricicTT

Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT

@odjuricicTT

Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT

@odjuricicTT

Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT

odjuricicTT requested review from tapspatel, jnie-TT, kmabeeTT, AleksKnezevic, pilkicTT, sdjordjevicTT, nsmithtt, rpavlovicTT, svuckovicTT, mtopalovicTT, mrakitaTT, nobradovictt and jserbedzijaTT as code owners October 10, 2024 13:49

svuckovicTT reviewed Oct 11, 2024

View reviewed changes

nsmithtt reviewed Oct 11, 2024

View reviewed changes

lib/Dialect/TTIR/Transforms/Passes.cpp Outdated Show resolved Hide resolved

nobradovictt reviewed Oct 14, 2024

View reviewed changes

nobradovictt approved these changes Oct 14, 2024

View reviewed changes

svuckovicTT approved these changes Oct 14, 2024

View reviewed changes

kmabeeTT approved these changes Oct 15, 2024

View reviewed changes

jnie-TT reviewed Oct 15, 2024

View reviewed changes

odjuricicTT added 5 commits October 16, 2024 09:06

Enable mnist sharding with layout overrides

193c750

* Generated ToLayout ops now suffix for location name * Add workaround ttnn failing to chose 1d matmul program config

Fix lint errors

c40b340

Fix workaround build

35d0dd6

Address comments

11a4e81

Address comments

e37bedd

odjuricicTT force-pushed the odjuricic/mnist-sharding branch from 159957a to e37bedd Compare October 16, 2024 09:07

odjuricicTT and others added 2 commits October 16, 2024 09:08

Add comments

edec7b9

Fix rebase conflicts

030193a

tapspatel approved these changes Oct 16, 2024

View reviewed changes

odjuricicTT merged commit 21955b4 into main Oct 16, 2024
12 checks passed

azecevicTT added a commit that referenced this pull request Nov 13, 2024

Matmul1DProgramConfig workaround removal

697653b

Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT

azecevicTT mentioned this pull request Nov 13, 2024

Matmul1DProgramConfig workaround removal #1248

Merged

azecevicTT added a commit that referenced this pull request Nov 13, 2024

Matmul1DProgramConfig workaround removal

b2c0315

Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT

azecevicTT added a commit that referenced this pull request Nov 14, 2024

Matmul1DProgramConfig workaround removal

a000cc7

Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT

azecevicTT added a commit that referenced this pull request Nov 14, 2024

Matmul1DProgramConfig workaround removal (#1248)

6dd5eba

Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable mnist sharding with layout overrides #894

Enable mnist sharding with layout overrides #894

odjuricicTT commented Oct 10, 2024

svuckovicTT Oct 11, 2024

odjuricicTT Oct 11, 2024

svuckovicTT Oct 11, 2024

odjuricicTT Oct 11, 2024

svuckovicTT Oct 11, 2024

odjuricicTT Oct 14, 2024

nobradovictt Oct 14, 2024 •

edited

Loading

svuckovicTT Oct 14, 2024

odjuricicTT Oct 14, 2024

nsmithtt Oct 11, 2024

odjuricicTT Oct 14, 2024

nobradovictt Oct 14, 2024

odjuricicTT Oct 14, 2024

odjuricicTT commented Oct 15, 2024

kmabeeTT left a comment

jnie-TT Oct 15, 2024

odjuricicTT commented Oct 16, 2024

tapspatel commented Oct 16, 2024

		for (int64_t operandIndex = 0, numOperands = op->getNumOperands();
		operandIndex < numOperands; ++operandIndex) {

		// This is a workaround for the lack of program config selection in ttnn.matmul.
		// The logic here is temporary and totaly incompleate.

Enable mnist sharding with layout overrides #894

Enable mnist sharding with layout overrides #894

Conversation

odjuricicTT commented Oct 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nobradovictt Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odjuricicTT commented Oct 15, 2024

kmabeeTT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odjuricicTT commented Oct 16, 2024

tapspatel commented Oct 16, 2024

nobradovictt Oct 14, 2024 •

edited

Loading