-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] Binary op with scalar is very slow #13644
Comments
@dmakoviichuk-tt Can you provide details on how you collected the performance? |
@dmakoviichuk-tt , my assumption is that you swapped out the ttnn function with a direct pytorch function that runs on host and saw the overall perf difference. If during training a small tensor is called multiple times I can see how the CPU branching predictions would be blazingly fast compared to pushing the tensor on and off device. Did you measure it by overall performance of the training time? |
@dmakoviichuk-tt , what was the size of the Tensor? |
@umadevimcw with timer. |
@eyonland it is obviously really bad and slow code for the very simple operation like this.
Please be respectful to your colleagues. Right now it looks like you are trying to avoid fixing that obvious issue! |
Sorry for the misunderstanding here. I was trying to figure out how you measured it originally. We absolutely should be passing a scalar as a runtime arg and never ever create a tensor. My time has been stretched thin on this issue and as well as rebuilding eltwise ops to properly handle broadcasting given that bcast does not adequately do this and the use of repeat is absolutely terrible given we make multiple calls. |
moving to P1. |
Describe the bug
Every time we call binary op with scalar we create tensor from this scalar and then also call ttnn::repeat:
We are using the in the optimizer step for each layer https://github.com/tenstorrent/TT-Tron/blob/main/sources/ttml/optimizers/sgd.cpp.
SGD performance is 10 times slower than pytorch cpu version.
To Reproduce
Just run any binary op with tensor and scalar.
Expected behavior
Scalar parameter should be passed as runtime arg to the program.
We should never create a new tensor from cpu from every call.
Additional context
@eyonland I assigned this ticket to you as elementwise owner.
My expectation that you can drive it to the LLK team and make sure they and your team can add needed changes in both metal and ttnn level. If you cannot do it for some reason please let me know I'll find a new owner.
It significantly reduces performance of our training code.
The text was updated successfully, but these errors were encountered: