-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple ttnn operations hang when run many times in loop on WH cards #4702
Comments
Hi @jliangTT , @eyonland , @nemanjagrujic Myself and @Aswinmcw worked on the issue and found the following issues
We have documented the same with screenshots here: TTNN hang Issue Analysis.pdf. Need your inputs on this to proceed further. |
Next step:
|
Hi @jliangTT
Would like your inputs to proceed further on this issue. |
From the latest main, all the test were fails with below error. Debugging in progress
|
Hi @nemanjagrujic , @eyonland , @KalaivaniMCW , During migration we have added only the Tile support for the below unary ops hence the test fails with runtime error.
During migration we have added only the Tile support for the below binary ops hence the test fails with runtime error.
I need to initiate a discussion with @KalaivaniMCW to proceed further with Row_major_support for the above failures. |
@ruthreshx Yes, at the time row major worked, and this hang was only observed for row major. |
Given that we no longer support auto-formatting and that we expect the user to use the to_layout, can we verify that this operation does not hang when run many times for tile layout? If it does not, please close this issue. |
@eyonland I can test, but I don't expect it to hang. |
Hi @nemanjagrujic , |
@ruthreshx Mul did not hang. I need to test others as well. |
@ruthreshx Sub did not hang over night. Next night I'll try another op. |
@nemanjagrujic I have updated the Layouts of all the tests in the PR #14318. Can you please test and confirm whether tests are hanging? |
@nemanjagrujic I have tested for tanh, exp and add its working fine. Observed no hang |
@umadevimcw Yes. No hangs with TILE. This was surely ROW_MAJOR issue. |
ttnn.mul, ttnn.tanh, ttnn.add, ttnn.sub, ttnn.exp, ttnn.gelu operations hang when run many times in loop. Hang happens only on Wormhole cards (for now).
To Reproduce
Steps to reproduce the behavior:
Checkout
syrmia/ttnn-sweeps
branch (Soon to be merged inmain
). Run unittests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_mul_hang.py
ortests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_tanh_hang.py
ortests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_add_hang.py
ortests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_sub_hang.py
ortests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_exp_hang.py
ortests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_gelu_hang.py
using this command (for instance):Expected behavior
Unit test runs 3 setups of the operation in a loop
After some non deterministic time of running ops (15-20 minutes) program hangs and stops running.
The text was updated successfully, but these errors were encountered: