Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable mnist sharding with layout overrides #894

Merged
merged 7 commits into from
Oct 16, 2024

Conversation

odjuricicTT
Copy link
Contributor

Mnist sharding works with a few caveats:

Generated ToLayout ops now have a suffix for location names in order to keep the names unique and be able to override op inputs as needed.

Comment on lines 697 to 698
for (int64_t operandIndex = 0, numOperands = op->getNumOperands();
operandIndex < numOperands; ++operandIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be for (int64_t operandIndex = 0; operandIndex < op->getNumOperands(); ++operandIndex)?

I guess it's a perf optimization, do we know if it's worth it though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Comment on lines +14 to +15
// This is a workaround for the lack of program config selection in ttnn.matmul.
// The logic here is temporary and totaly incompleate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a workaround for emitc as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary workaround just for mnist until the fix for this lands in metal. Do we run mnist trough emitc atm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to :)

In general, I don't this it's okay to implement workarounds in one of the "backends" of the IR. This puts pressure on the non-workaroundED backends to play catch up. We're already deep enough with workarounds in ttrt, adding some more will only increase the gap with emitc.

We should either:

  1. workaround this on the IR level
  2. wait for metal to fix and uplift
  3. provide workarounds for both (all) IR "backends"

If there's a strong enough reason (e.g. it's blocking the whole team for an important deliverable, so passing around git diff patches is not sustainable), I guess we could make an exception, but let's do that only with a strong commitment and an ETA on a fix from metal team. Currently, the linked issue tenstorrent/tt-metal#13204 doesn't seem to have an owner, nor any traction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, @skrsticTT will be taking a look at the issue on metal this week. Due to the fact that his team is in the middle of a reorg, there is no ETA nor guarantee that this will be fixed shortly.

I'm all for waiting for a fix in metal, but this was prioritized as a result of the F2F discussions, so diverting the question of how urgent this is to someone with more context @nobradovictt.

Copy link
Contributor

@nobradovictt nobradovictt Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to running MNIST or not running, this is related to running MNIST fully L1 sharded to gain/measure performance and evaluate optimizer implementation. MNIST is functionally being run already via generality path which includes emitc. Workaround is added based on instructions for runtime component to do so and properly tagged with corresponding blocking issue to remove it. Optimizer development must not be blocked on current state of other components.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context @nobradovictt

properly tagged with corresponding blocking issue to remove it

Can you tag the issue here please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is: #891

  // TODO(bug #891): ttnn::matmul doesn't chose correct program config.
  bool setMatmul1DProgramConfig;

}

return loc;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding, this preserves the original file location right? It results in kind of that nested loc syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use FileLineColLoc nested inside a NameLoc. This preserves the same FileLineColLoc with a new name.

%0 = tensor.empty() : tensor<64x96xbf16> loc(#loc2)
// CHECK-DAG: %{{.*}} = "ttnn.to_device"{{.*}} loc(#[[LOC_MATMUL_IN0]])
// CHECK-DAG: %{{.*}} = "ttnn.to_device"{{.*}} -> tensor<128x96xbf16, #[[IN_1_LAYOUT]]> loc(#[[LOC_MATMUL_IN1]])
// CHECK-DAG: %{{.*}} = "ttnn.matmul"{{.*}} loc(#[[LOC_MATMUL]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all these be CHECK-DAG, ie all can be out of order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't make it work any other way. My current understanding is that because of the CHECK-DAG for matmul locations at the start, any subsequent CHECK statements start scanning from the loc definitions, which are at the end of file.

If you know of a bette way, i' open to suggestions.

@odjuricicTT
Copy link
Contributor Author

@AleksKnezevic Can i get a review from runtime codeowners please?

Copy link
Contributor

@kmabeeTT kmabeeTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aprove runtime, thx for discussion and workaround flag


if (workaround::Env::get().setMatmul1DProgramConfig &&
outputMemoryConfig.memory_layout ==
::tt::tt_metal::TensorMemoryLayout::WIDTH_SHARDED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the issue number here as well?

* Generated ToLayout ops now suffix for location name
* Add workaround ttnn failing to chose 1d matmul program config
@odjuricicTT odjuricicTT force-pushed the odjuricic/mnist-sharding branch from 159957a to e37bedd Compare October 16, 2024 09:07
@odjuricicTT
Copy link
Contributor Author

@tapspatel Can you take a look at the PR please? Your approval is needed before merge.

@tapspatel
Copy link
Contributor

runtime changes look good!

@odjuricicTT odjuricicTT merged commit 21955b4 into main Oct 16, 2024
12 checks passed
azecevicTT added a commit that referenced this pull request Nov 13, 2024
Workaround introduced in #894
is not needed anymore. The issue was fixed in metal
tenstorrent/tt-metal#13819.

Closes #891

FYI @odjuricicTT
azecevicTT added a commit that referenced this pull request Nov 13, 2024
Workaround introduced in #894
is not needed anymore. The issue was fixed in metal
tenstorrent/tt-metal#13819.

Closes #891

FYI @odjuricicTT
azecevicTT added a commit that referenced this pull request Nov 14, 2024
Workaround introduced in #894
is not needed anymore. The issue was fixed in metal
tenstorrent/tt-metal#13819.

Closes #891

FYI @odjuricicTT
azecevicTT added a commit that referenced this pull request Nov 14, 2024
Workaround introduced in #894
is not needed anymore. The issue was fixed in metal
tenstorrent/tt-metal#13819.

Closes #891

FYI @odjuricicTT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants