-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable mnist sharding with layout overrides #894
Conversation
for (int64_t operandIndex = 0, numOperands = op->getNumOperands(); | ||
operandIndex < numOperands; ++operandIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be for (int64_t operandIndex = 0; operandIndex < op->getNumOperands(); ++operandIndex)
?
I guess it's a perf optimization, do we know if it's worth it though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
// This is a workaround for the lack of program config selection in ttnn.matmul. | ||
// The logic here is temporary and totaly incompleate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a workaround for emitc as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary workaround just for mnist until the fix for this lands in metal. Do we run mnist trough emitc atm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to :)
In general, I don't this it's okay to implement workarounds in one of the "backends" of the IR. This puts pressure on the non-workaroundED backends to play catch up. We're already deep enough with workarounds in ttrt, adding some more will only increase the gap with emitc.
We should either:
- workaround this on the IR level
- wait for metal to fix and uplift
- provide workarounds for both (all) IR "backends"
If there's a strong enough reason (e.g. it's blocking the whole team for an important deliverable, so passing around git diff patches is not sustainable), I guess we could make an exception, but let's do that only with a strong commitment and an ETA on a fix from metal team. Currently, the linked issue tenstorrent/tt-metal#13204 doesn't seem to have an owner, nor any traction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, @skrsticTT will be taking a look at the issue on metal this week. Due to the fact that his team is in the middle of a reorg, there is no ETA nor guarantee that this will be fixed shortly.
I'm all for waiting for a fix in metal, but this was prioritized as a result of the F2F discussions, so diverting the question of how urgent this is to someone with more context @nobradovictt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to running MNIST or not running, this is related to running MNIST fully L1 sharded to gain/measure performance and evaluate optimizer implementation. MNIST is functionally being run already via generality path which includes emitc. Workaround is added based on instructions for runtime component to do so and properly tagged with corresponding blocking issue to remove it. Optimizer development must not be blocked on current state of other components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context @nobradovictt
properly tagged with corresponding blocking issue to remove it
Can you tag the issue here please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is: #891
// TODO(bug #891): ttnn::matmul doesn't chose correct program config.
bool setMatmul1DProgramConfig;
} | ||
|
||
return loc; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own understanding, this preserves the original file location right? It results in kind of that nested loc syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we use FileLineColLoc nested inside a NameLoc. This preserves the same FileLineColLoc with a new name.
%0 = tensor.empty() : tensor<64x96xbf16> loc(#loc2) | ||
// CHECK-DAG: %{{.*}} = "ttnn.to_device"{{.*}} loc(#[[LOC_MATMUL_IN0]]) | ||
// CHECK-DAG: %{{.*}} = "ttnn.to_device"{{.*}} -> tensor<128x96xbf16, #[[IN_1_LAYOUT]]> loc(#[[LOC_MATMUL_IN1]]) | ||
// CHECK-DAG: %{{.*}} = "ttnn.matmul"{{.*}} loc(#[[LOC_MATMUL]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all these be CHECK-DAG, ie all can be out of order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't make it work any other way. My current understanding is that because of the CHECK-DAG for matmul locations at the start, any subsequent CHECK statements start scanning from the loc definitions, which are at the end of file.
If you know of a bette way, i' open to suggestions.
@AleksKnezevic Can i get a review from runtime codeowners please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aprove runtime, thx for discussion and workaround flag
|
||
if (workaround::Env::get().setMatmul1DProgramConfig && | ||
outputMemoryConfig.memory_layout == | ||
::tt::tt_metal::TensorMemoryLayout::WIDTH_SHARDED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the issue number here as well?
* Generated ToLayout ops now suffix for location name * Add workaround ttnn failing to chose 1d matmul program config
159957a
to
e37bedd
Compare
@tapspatel Can you take a look at the PR please? Your approval is needed before merge. |
runtime changes look good! |
Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT
Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT
Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT
Workaround introduced in #894 is not needed anymore. The issue was fixed in metal tenstorrent/tt-metal#13819. Closes #891 FYI @odjuricicTT
Mnist sharding works with a few caveats:
Generated ToLayout ops now have a suffix for location names in order to keep the names unique and be able to override op inputs as needed.