-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[N300] TTNN Unit Test Failures: Op PCC failures #6991
Comments
Expanding this list to include:
|
@cfjchu Except acosh and logit all other tests are passing in the recent main. For testing, I have removed the skip statement and tested it. Attached the images below for your reference Also, the reason for the failure of acosh and logit op is that, nan cannot be stored in WHB0 as discussed here @cfjchu Can you check and close this issue? @jliangTT Your comments please |
@umadevimcw, is there anyway to to enable acosh and logit to test for non-nan scenarios? |
Can you please submit a PR so we can verify this on our CI runners? |
@jliangTT and @cfjchu we can do this. I will raise the PR with the updated test files, trigger CI runs and share the links here |
@eyonland @bbradelTT In a recent test run, I executed the following files: test_min.py, test_mean.py, test_reduction.py, and test_sum.py, and all the tests passed. However, when I triggered the CI pipeline, it failed again. To troubleshoot, I ran the same command used in CI locally:
This command caused the tests to fail. After the failure, I noticed that even when running test_mean.py individually, it was failing. To resolve this, I reset the card and reran the individual tests, and they passed. Running test_mean.py or test_min.py after test_maxpool2d.py causes test failures in CI. Both the Maxpool and Mean tests are reduction-based, so I suspect that the registers might not be cleared properly, which could be impacting the results in WHB0, though I'm not entirely certain. To mitigate this issue and avoid CI failures, I renamed these files and updated the PR #14669 so that these tests are moved to a different group. In this PPR there is no CI failures |
@bbradelTT We can reproduce the error by running
|
@umadevimcw in other words, there is an interaction between maxpool2d and mean that spans across tests and all of the setup and teardown (opening/closing the device, etc.), correct? |
* #6991: Update test skips * #6991: Rename to update the position of the test files --------- Co-authored-by: umadevimcw <[email protected]>
@bbradelTT , this problem where the maxpool leaves the state of the device such that the next op (in this case the reduce op) does not work correctly sounds a lot like the bug in WH mentioned here #13569. @yan-zaretskiy can give more details on this but it sounded like a register is not being properly cleared. |
|
* tenstorrent#6991: Update test skips * tenstorrent#6991: Rename to update the position of the test files --------- Co-authored-by: umadevimcw <[email protected]>
Minimum example: test_pool.py
generated/watcher/kernel_names.txt
Output:
|
@ncvetkovicTT I'll try to isolate what is causing the problem. Feels like something in the inits. |
I tried making changes to ttnn/cpp/ttnn/operations/pool/generic/device/kernels/compute/max_pool_multi_core.cpp After adding
the PCC is still bad on the second call. After commenting out
the PCC is good on the second call. |
Related to #15824 |
Narrowed it down further: Commented out tt_metal/include/compute_kernel_api/pack_untilize.h
and reduce worked. |
@bbradelTT Is this still an open issue ? If not, can you please mark this closed ? Thanks. |
@prajaramanTT It is still open, being tracked together with #15824 |
Eltwise Unary failures:
Will be disabling this test while I enable N300 unit test suite. fyi @jliangTT @xanderchin
The text was updated successfully, but these errors were encountered: