Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[N300] TTNN Unit Test Failures: Op PCC failures #6991

Open
Tracked by #13795
cfjchu opened this issue Apr 2, 2024 · 26 comments
Open
Tracked by #13795

[N300] TTNN Unit Test Failures: Op PCC failures #6991

cfjchu opened this issue Apr 2, 2024 · 26 comments
Assignees

Comments

@cfjchu
Copy link
Collaborator

cfjchu commented Apr 2, 2024

Eltwise Unary failures:

FAILED tests/ttnn/unit_tests/operations/test_unary.py::test_gelu[w=128-h=64] - AssertionError: 0.9996086108852655
FAILED tests/ttnn/unit_tests/operations/test_unary.py::test_asinh[w=128-h=64] - AssertionError: 0.999800979188091
FAILED tests/ttnn/unit_tests/operations/test_unary.py::test_acosh[w=128-h=64] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_unary.py::test_logit[w=128-h=64-scalar=2] - AssertionError: 0.0

Will be disabling this test while I enable N300 unit test suite. fyi @jliangTT @xanderchin

@cfjchu
Copy link
Collaborator Author

cfjchu commented Apr 2, 2024

Expanding this list to include:

FAILED tests/ttnn/unit_tests/operations/test_mean.py::test_mean[dim=-2-w=32-h=32-batch_size=1] - AssertionError: 0.7129726228467889
FAILED tests/ttnn/unit_tests/operations/test_mean.py::test_mean[dim=-2-w=32-h=32-batch_size=16] - AssertionError: 0.7205312620743286
FAILED tests/ttnn/unit_tests/operations/test_mean.py::test_mean[dim=-2-w=32-h=64-batch_size=1] - AssertionError: 0.8355362743165149
FAILED tests/ttnn/unit_tests/operations/test_mean.py::test_mean[dim=-2-w=32-h=64-batch_size=16] - AssertionError: 0.7062494470199938
FAILED tests/ttnn/unit_tests/operations/test_mean.py::test_mean[dim=-2-w=64-h=32-batch_size=1] - AssertionError: 0.7373753494055079
FAILED tests/ttnn/unit_tests/operations/test_mean.py::test_mean[dim=-2-w=64-h=32-batch_size=16] - AssertionError: 0.7118005357145872
FAILED tests/ttnn/unit_tests/operations/test_mean.py::test_mean[dim=-2-w=64-h=64-batch_size=1] - AssertionError: 0.7463230337588651
FAILED tests/ttnn/unit_tests/operations/test_mean.py::test_mean[dim=-2-w=64-h=64-batch_size=16] - AssertionError: 0.7176044785834667
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min[dim=-2-w=32-h=32-batch_size=1] - AssertionError: 0.37873407146292576
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min[dim=-2-w=32-h=32-batch_size=16] - AssertionError: 0.023437531800448726
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min[dim=-2-w=32-h=64-batch_size=1] - AssertionError: 0.24575360587110964
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min[dim=-2-w=32-h=64-batch_size=16] - AssertionError: 0.008568280199380923
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min[dim=-2-w=64-h=32-batch_size=1] - AssertionError: 0.19078116935535616
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min[dim=-2-w=64-h=32-batch_size=16] - AssertionError: 0.061311480784068034
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min[dim=-2-w=64-h=64-batch_size=1] - AssertionError: 0.12420826822825638
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min[dim=-2-w=64-h=64-batch_size=16] - AssertionError: 0.0442237595166964
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min_global[w=32-h=32-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min_global[w=32-h=32-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min_global[w=32-h=64-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min_global[w=32-h=64-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min_global[w=64-h=32-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min_global[w=64-h=32-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min_global[w=64-h=64-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_min.py::test_min_global[w=64-h=64-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_std[dim=-2-w=32-h=32-batch_size=1] - AssertionError: -0.01592792482127716
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_std[dim=-2-w=32-h=32-batch_size=16] - AssertionError: 0.11758006507833155
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_std[dim=-2-w=32-h=64-batch_size=1] - AssertionError: 0.0879071839955282
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_std[dim=-2-w=32-h=64-batch_size=16] - AssertionError: 0.11797115729529092
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_std[dim=-2-w=64-h=32-batch_size=1] - AssertionError: 0.1015263161681264
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_std[dim=-2-w=64-h=32-batch_size=16] - AssertionError: 0.13608048986141633
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_std[dim=-2-w=64-h=64-batch_size=1] - AssertionError: -0.008369961834068705
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_std[dim=-2-w=64-h=64-batch_size=16] - AssertionError: 0.06822360579594006
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_var[dim=-2-w=32-h=32-batch_size=1] - AssertionError: 0.1436355302489401
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_var[dim=-2-w=32-h=32-batch_size=16] - AssertionError: 0.245221336097406
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_var[dim=-2-w=32-h=64-batch_size=1] - AssertionError: 0.18272780832260738
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_var[dim=-2-w=32-h=64-batch_size=16] - AssertionError: 0.20706627347577075
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_var[dim=-2-w=64-h=32-batch_size=1] - AssertionError: 0.20329528485682932
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_var[dim=-2-w=64-h=32-batch_size=16] - AssertionError: 0.25142007288517854
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_var[dim=-2-w=64-h=64-batch_size=1] - AssertionError: 0.09586102301832924
FAILED tests/ttnn/unit_tests/operations/test_reduction.py::test_var[dim=-2-w=64-h=64-batch_size=16] - AssertionError: 0.15608310215631055
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=-2-w=32-h=32-batch_size=1] - AssertionError: 0.7132472785346919
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=-2-w=32-h=32-batch_size=16] - AssertionError: 0.7206739371426938
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=-2-w=32-h=64-batch_size=1] - AssertionError: 0.8352204242661911
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=-2-w=32-h=64-batch_size=16] - AssertionError: 0.7064621099855871
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=-2-w=64-h=32-batch_size=1] - AssertionError: 0.7371152272048267
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=-2-w=64-h=32-batch_size=16] - AssertionError: 0.7118459985631931
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=-2-w=64-h=64-batch_size=1] - AssertionError: 0.7459746034644239
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=-2-w=64-h=64-batch_size=16] - AssertionError: 0.7175584942718809
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=(2, 1)-w=32-h=32-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=(2, 1)-w=32-h=32-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=(2, 1)-w=32-h=64-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=(2, 1)-w=32-h=64-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=(2, 1)-w=64-h=32-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=(2, 1)-w=64-h=32-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=(2, 1)-w=64-h=64-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum[dim=(2, 1)-w=64-h=64-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum_global[w=32-h=32-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum_global[w=32-h=32-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum_global[w=32-h=64-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum_global[w=32-h=64-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum_global[w=64-h=32-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum_global[w=64-h=32-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum_global[w=64-h=64-batch_size=1] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_sum.py::test_sum_global[w=64-h=64-batch_size=16] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_unary.py::test_gelu[w=128-h=64] - AssertionError: 0.9996086108852716
FAILED tests/ttnn/unit_tests/operations/test_unary.py::test_asinh[w=128-h=64] - AssertionError: 0.999800[9791](https://github.com/tenstorrent-metal/tt-metal/actions/runs/8515181276/job/23322247056#step:8:9792)880944
FAILED tests/ttnn/unit_tests/operations/test_unary.py::test_acosh[w=128-h=64] - AssertionError: 0.0
FAILED tests/ttnn/unit_tests/operations/test_unary.py::test_logit[w=128-h=64-scalar=2] - AssertionError: 0.0

@umadevimcw
Copy link
Contributor

umadevimcw commented Apr 9, 2024

@cfjchu Except acosh and logit all other tests are passing in the recent main. For testing, I have removed the skip statement and tested it. Attached the images below for your reference

Screenshot 2024-04-09 at 6 27 41 PM Screenshot 2024-04-09 at 6 23 56 PM Screenshot 2024-04-09 at 6 24 33 PM

Also, the reason for the failure of acosh and logit op is that, nan cannot be stored in WHB0 as discussed here

image

#4409 (comment)

@cfjchu Can you check and close this issue?

@jliangTT Your comments please

@jliangTT
Copy link

@umadevimcw, is there anyway to to enable acosh and logit to test for non-nan scenarios?

@cfjchu
Copy link
Collaborator Author

cfjchu commented Apr 11, 2024

Can you please submit a PR so we can verify this on our CI runners?

@umadevimcw
Copy link
Contributor

@umadevimcw, is there anyway to to enable acosh and logit to test for non-nan scenarios?

@jliangTT and @cfjchu we can do this. I will raise the PR with the updated test files, trigger CI runs and share the links here

@umadevimcw
Copy link
Contributor

umadevimcw commented Nov 5, 2024

@eyonland @bbradelTT In a recent test run, I executed the following files: test_min.py, test_mean.py, test_reduction.py, and test_sum.py, and all the tests passed. However, when I triggered the CI pipeline, it failed again. To troubleshoot, I ran the same command used in CI locally:

pytest tests/ttnn/unit_tests -xv --splits 6 --group 2 -m "not disable_fast_runtime_mode"

This command caused the tests to fail. After the failure, I noticed that even when running test_mean.py individually, it was failing. To resolve this, I reset the card and reran the individual tests, and they passed.

Running test_mean.py or test_min.py after test_maxpool2d.py causes test failures in CI. Both the Maxpool and Mean tests are reduction-based, so I suspect that the registers might not be cleared properly, which could be impacting the results in WHB0, though I'm not entirely certain.

To mitigate this issue and avoid CI failures, I renamed these files and updated the PR #14669 so that these tests are moved to a different group. In this PPR there is no CI failures

@umadevimcw
Copy link
Contributor

@bbradelTT We can reproduce the error by running

pytest tests/ttnn/unit_tests/operations/test_maxpool2d.py first and then

pytest tests/ttnn/unit_tests/operations/test_mean.py

@bbradelTT
Copy link
Contributor

@bbradelTT We can reproduce the error by running

pytest tests/ttnn/unit_tests/operations/test_maxpool2d.py first and then

pytest tests/ttnn/unit_tests/operations/test_mean.py

@umadevimcw in other words, there is an interaction between maxpool2d and mean that spans across tests and all of the setup and teardown (opening/closing the device, etc.), correct?

umadevimcw pushed a commit that referenced this issue Nov 6, 2024
umadevimcw pushed a commit that referenced this issue Nov 6, 2024
umadevimcw added a commit that referenced this issue Nov 6, 2024
* #6991: Update test skips

* #6991: Rename to update the position of the test files

---------

Co-authored-by: umadevimcw <[email protected]>
@eyonland
Copy link
Contributor

eyonland commented Nov 7, 2024

@bbradelTT , this problem where the maxpool leaves the state of the device such that the next op (in this case the reduce op) does not work correctly sounds a lot like the bug in WH mentioned here #13569. @yan-zaretskiy can give more details on this but it sounded like a register is not being properly cleared.

@umadevimcw
Copy link
Contributor

@bbradelTT We can reproduce the error by running

pytest tests/ttnn/unit_tests/operations/test_maxpool2d.py first and then

pytest tests/ttnn/unit_tests/operations/test_mean.py

test_mean.py is renamed to test_reduction_mean.py

@bbradelTT bbradelTT added the P1 label Nov 7, 2024
ct-clmsn pushed a commit to ct-clmsn/tt-metal that referenced this issue Nov 12, 2024
* tenstorrent#6991: Update test skips

* tenstorrent#6991: Rename to update the position of the test files

---------

Co-authored-by: umadevimcw <[email protected]>
@bbradelTT
Copy link
Contributor

bbradelTT commented Dec 23, 2024

Minimum example:
tt-smi reset then run the following via python test_pool.py

test_pool.py

import torch
import ttnn
from models.utility_functions import torch_random, comp_allclose
from tests.ttnn.utils_for_testing import assert_with_pcc


def test_maxpool(device, input_shape, kernel_size, stride, padding, dilation):
    torch_input = torch.rand(input_shape, dtype=torch.bfloat16)
    batch_size, in_c, in_h, in_w = input_shape

    input_tensor = torch.permute(torch_input, (0, 2, 3, 1))
    input_tensor = torch.reshape(input_tensor, (1, 1, -1, in_c))
    input_tensor = ttnn.from_torch(input_tensor, layout=ttnn.ROW_MAJOR_LAYOUT, device=device)
    output_tensor = ttnn.max_pool2d(
        input_tensor,
        batch_size,
        in_h,
        in_w,
        in_c,
        kernel_size,
        stride,
        padding,
        dilation,
    )

    expected_output = torch.nn.functional.max_pool2d(torch_input, kernel_size, stride, padding)

    output_tensor = ttnn.to_torch(output_tensor)
    _, out_c, out_h, out_w = expected_output.shape
    output_tensor = torch.reshape(output_tensor, (batch_size, out_h, out_w, out_c))
    output_tensor = torch.permute(output_tensor, (0, 3, 1, 2))

    # COMMENTED OUT FOR INIT DEBUGGING assert torch.allclose(output_tensor, expected_output), "mismatch" 


def test_global_avg_pool(device, input_shape):
    torch_input_tensor = torch.randn(input_shape, dtype=torch.bfloat16)
    torch_input_tensor = torch.ones(input_shape, dtype=torch.bfloat16)
    torch_output_tensor = torch.nn.functional.adaptive_avg_pool2d(torch_input_tensor, (1, 1))

    input_tensor = torch.permute(torch_input_tensor, (0, 2, 3, 1))
    input_tensor = ttnn.from_torch(input_tensor, layout=ttnn.TILE_LAYOUT, device=device)
    output_tensor = ttnn.global_avg_pool2d(input_tensor)
    #print(f'{input_tensor}\n{output_tensor}\ntot{torch_output_tensor}')
    output_tensor = ttnn.to_torch(output_tensor)
    output_tensor = torch.permute(output_tensor, (0, 3, 1, 2))

    assert_with_pcc(torch_output_tensor, output_tensor)

def test_mean(device, batch_size, h, w, dim):
    torch.manual_seed(0)

    torch_input_tensor = torch_random((batch_size, h, w), -1, 1, dtype=torch.bfloat16)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=dim, keepdim=True, dtype=torch.bfloat16)

    input_tensor = ttnn.from_torch(torch_input_tensor, layout=ttnn.TILE_LAYOUT, device=device)

    output_tensor = ttnn.mean(input_tensor, dim=dim)
    output_tensor = ttnn.to_torch(output_tensor)
    assert_with_pcc(torch_output_tensor, output_tensor)

if __name__ == "__main__":
    try:
        device = ttnn.open_device(device_id=0, l1_small_size=4096)
        test_maxpool(device, (1, 192, 56, 56), (2, 2), (2, 2), (0, 0), (1, 1))
        #test_global_avg_pool(device, (1, 64, 1, 32))
        test_mean(device, 1, 32, 32, -2)
    finally:
        ttnn.close_device(device)

generated/watcher/kernel_names.txt

0: blank
1: tt_metal/impl/dispatch/kernels/cq_prefetch.cpp
2: tt_metal/impl/dispatch/kernels/cq_dispatch.cpp
3: tt_metal/impl/dispatch/kernels/cq_dispatch_slave.cpp
4: ttnn/cpp/ttnn/operations/reduction/generic/device/kernels/dataflow/reader_unary_transpose_wh_interleaved_input_cols_partitioned.cpp
5: ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/writer_unary_interleaved_start_id.cpp
6: ttnn/cpp/ttnn/operations/reduction/generic/device/kernels/compute/reduce_h.cpp
7: ttnn/cpp/ttnn/operations/data_movement/sharded/device/kernels/dataflow/reader_unary_stick_layout_sharded_blocks_interleaved_start_id.cpp
8: ttnn/cpp/ttnn/operations/data_movement/sharded/device/kernels/dataflow/writer_unary_sharded.cpp
9: ttnn/cpp/ttnn/operations/data_movement/untilize_with_halo_v2/device/kernels/dataflow/halo_gather.cpp
10: ttnn/cpp/ttnn/operations/data_movement/untilize_with_halo_v2/device/kernels/dataflow/halo_gather.cpp
11: ttnn/cpp/ttnn/operations/pool/generic/device/kernels/dataflow/reader_max_pool_2d_multi_core_sharded_with_halo_v2.cpp
12: ttnn/cpp/ttnn/operations/pool/generic/device/kernels/dataflow/reader_max_pool_2d_multi_core_sharded_with_halo_v2.cpp
13: ttnn/cpp/ttnn/operations/pool/generic/device/kernels/compute/max_pool_multi_core.cpp
14: ttnn/cpp/ttnn/operations/reduction/generic/device/kernels/dataflow/reader_unary_transpose_wh_interleaved_input_cols_partitioned.cpp
15: ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/writer_unary_interleaved_start_id.cpp
16: ttnn/cpp/ttnn/operations/reduction/generic/device/kernels/compute/reduce_h.cpp

Output:

Running First Mean
Got here. PCC is okay.
Running maxpool
Running Second Mean
Traceback (most recent call last):
  File "run_pool.py", line 58, in <module>
    test_mean(device, 1, 32, 32, -2)
  File "run_pool.py", line 46, in test_mean
    assert_with_pcc(torch_output_tensor, output_tensor)
  File "/proj_sw/user_dev/bbradel/tt-metal/tests/ttnn/utils_for_testing.py", line 57, in assert_with_pcc
    assert pcc_passed, construct_pcc_assert_message(pcc_message, expected_pytorch_result, actual_pytorch_result)
AssertionError: 0.712975805693972

@bbradelTT
Copy link
Contributor

@ncvetkovicTT I'll try to isolate what is causing the problem. Feels like something in the inits.

@bbradelTT
Copy link
Contributor

I tried making changes to ttnn/cpp/ttnn/operations/pool/generic/device/kernels/compute/max_pool_multi_core.cpp

After adding

    tilize_uninit(in_cb_id,out_cb_id);
    pack_untilize_uninit(out_cb_id);

the PCC is still bad on the second call.

After commenting out

    pack_untilize_dst_init_short<in_ntiles_c>(
        out_cb_id, num_out_rows, num_faces_in_tile); /* pack 1 row (1x16 or 1x32) */

the PCC is good on the second call.

@bbradelTT
Copy link
Contributor

Related to #15824

@bbradelTT
Copy link
Contributor

Narrowed it down further:

Commented out

tt_metal/include/compute_kernel_api/pack_untilize.h

    PACK((llk_pack_untilize_init<block_ct_dim, full_ct_dim, diagonal, narrow_row, row_num_datums>(
        ocb, face_r_dim, num_faces)));

and reduce worked.

@prajaramanTT
Copy link

@bbradelTT Is this still an open issue ? If not, can you please mark this closed ? Thanks.

@ncvetkovicTT
Copy link
Contributor

@prajaramanTT It is still open, being tracked together with #15824

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests