[Feature Request] Improvement Needed for Unit Tests #6633

hschoi4448 · 2024-03-21T07:51:34Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I recently reviewed the backward ops and found several bugs.

I believe there are two main reasons why there were so many bugs in backward ops:

The compare_result function often returns a pass even when the correct value and the TT result differ significantly. This may result in many bugs going unnoticed.
ex) [Bug Report] invalid softplus backward result #6598
The input data used in unit tests does not always reflect the characteristics of the ops
For instance, in the case of relu6, the gradient formula varies depending on whether the input falls within the range of 0 to 6. Therefore, to test all intervals effectively, the input data should include values around 0, 6, and nearby points, such as [-1, 0, 3, 6, 7].
However, currently, input data is generated using torch.randn, results in values mostly around [-1 to 1], neglecting testing in the vicinity of 6 and its surrounding intervals.

ex)

I didn't run all unit tests during the review, and only checked the suspicious parts, so I believe there are actually more bugs.
Improving unit tests seems to be a high priority to address recurring issues and find hidden bugs.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Regarding the first issue, I'm not sure what the best solution would be.
For the second issue, a method is needed to input specific values related to the characteristics of the op as test data.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

razorback3 · 2024-03-21T08:43:24Z

I think compare_result may be changed to pass both pcc check and allclose check.
(For right now, I see that tests are made to be passed when one of the checks is passed)
Moreover, allowing user to set atol and rtol for compare_result should be added.

jliangTT · 2024-03-21T20:23:07Z

yes. @razorback3. i will follow up and get back to you with a response. This is really not good.

umadevimcw · 2024-03-22T10:32:16Z

@razorback3 @jliangTT @jvasilje We have updated the datagen and comparison function in PR #6679 for debugging.
We are parallel working on updating the test files with reference to this PR. Once the PR is merged with the main will submit the changes for all the Ops in a separate PR

umadevimcw · 2024-03-22T10:40:02Z

@razorback3 @jliangTT @jvasilje @hschoi4448 We are also observing the following scenario (not sure whether this is an expected behavior),

if the input is filled with constant 1.0 we are getting nan as expected as a result
if the input is randomly filled for the same value 1.0 we are getting a larger number as a result (as shown in the image) which results in a PCC failure issue

To handle this we have to add the condition separately in the logic. We are checking this kind of issue also.

Scenario 1:

Scenario 2:

Also few ops depends on this issue as well #6676

jliangTT · 2024-04-02T18:39:01Z

Status:

@umadevimcw and team has complete sweep of the existing test casest to close the gap in the test cases.
There are still issues that needs to be debugged but it is tracked separately.
the initial p0 sprint - to re-inspect the test quality is done and is recorded here
https://docs.google.com/spreadsheets/d/17vJReyuZgWtzlal_TSCff9v_lDrmLqNK/edit#gid=66577367

I think we can downgrade this to p1.

razorback3 · 2024-04-03T05:55:22Z

@jliangTT
Would you give access permission of google docs to me and @hschoi4448 ?

[email protected]
[email protected]

jliangTT · 2024-04-10T18:34:37Z

please see this doc - https://docs.google.com/spreadsheets/d/1VV-EwGJn1EgBN3jX3tg4TcX_yDkm5HAO/edit#gid=66577367

(sorry i have to made a copy due to not being able to get around access control)

jliangTT · 2024-04-11T22:29:58Z

will close this one for now. Can track the issue. Please re-open if you have any concerns.

razorback3 · 2024-04-12T09:34:43Z

OK. @hschoi4448 will double-check the result when he comes back from his vacation.

hschoi4448 · 2024-04-16T01:47:40Z

#6583 still has a problem.

Test cdoe

# SPDX-FileCopyrightText: © 2023 Tenstorrent Inc.

# SPDX-License-Identifier: Apache-2.0

import torch
import pytest
import tt_lib
from tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs import data_gen_pt_tt, compare_results

def data_gen_pt_tt(input_shapes, device, required_grad=False, val=1):
    pt_tensor = (torch.ones(input_shapes, requires_grad=required_grad) * val).bfloat16()
    tt_tensor = (
        tt_lib.tensor.Tensor(pt_tensor, tt_lib.tensor.DataType.BFLOAT16).to(tt_lib.tensor.Layout.TILE).to(device)
    )
    return pt_tensor, tt_tensor

@pytest.mark.parametrize(
    "input_shapes",
    (
        (torch.Size([1, 1, 32, 32])),
    ),
)
def test_bw_acosh(input_shapes, device):
    in_data, input_tensor = data_gen_pt_tt(input_shapes, device, True, val=0.5)
    grad_data, grad_tensor = data_gen_pt_tt(input_shapes, device, False, val=1)

    print("input_tensor", input_tensor)
    print("grad_tensor", grad_tensor)
    
    pyt_y = torch.acosh(in_data)

    tt_output_tensor_on_device = tt_lib.tensor.acosh_bw(grad_tensor, input_tensor)

    in_data.retain_grad()

    pyt_y.backward(gradient=grad_data)

    golden_tensor = [in_data.grad]

    comp_pass = compare_results(tt_output_tensor_on_device, golden_tensor)
    
    print("tt_output_tensor_on_device", tt_output_tensor_on_device)
    print("golden_tensor", golden_tensor)
    assert comp_pass

output

input_tensor ttnn.Tensor([[[[ 0.50000,  0.50000,  ...,  0.50000,  0.50000],
               [ 0.50000,  0.50000,  ...,  0.50000,  0.50000],
               ...,
               [ 0.50000,  0.50000,  ...,  0.50000,  0.50000],
               [ 0.50000,  0.50000,  ...,  0.50000,  0.50000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
grad_tensor ttnn.Tensor([[[[ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
2024-04-16 01:45:38.796 | ERROR    | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:32 - One tensor is all nan, the other is not.
2024-04-16 01:45:38.796 | ERROR    | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:32 - One tensor is all nan, the other is not.
2024-04-16 01:45:38.797 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:62 - False
2024-04-16 01:45:38.797 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:63 - False
2024-04-16 01:45:38.797 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:64 - Max ATOL Delta: nan, Max RTOL Delta: nan, PCC: 0.0, PCC check failed
tt_output_tensor_on_device [ttnn.Tensor([[[[inf     , inf     ,  ..., inf     , inf     ],
               [inf     , inf     ,  ..., inf     , inf     ],
               ...,
               [inf     , inf     ,  ..., inf     , inf     ],
               [inf     , inf     ,  ..., inf     , inf     ]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)]
golden_tensor [tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]]]], dtype=torch.bfloat16)]

hschoi4448 · 2024-04-16T03:06:55Z

Could you please check if the issues below are still problematic?

[Bug Report] invalid acosh backward result #6583
[Bug Report] invalid isclose result #6672
[Bug Report] invalid erfc result #6737
[Bug Report] invalid tan result #6735
[Bug Report] invalid signbit result #6728
[Bug Report] Invalid log result #6721
[Bug Report] Invalid recip result #6720 : I saw a PR claiming to have resolved the issue #6720: Fix recip #7000. Why wasn't it merged?

umadevimcw · 2024-04-16T03:34:37Z

@hschoi4448 for all the above tagged issues has hardware and performance limitations where handling/storing Nan /inf is the problem

In WHB0 storing Nan is not possible
Storing Nan/inf in ckernel op is related to performance problem

If you check each issue we have added observation and even for few ops we have raised the PR with fix

PR is not merged due to pending approval from code owner

Please find @rtawfik01 comment below

These checks also affect performance, are the users alright with a performance hit?

Also for tan op we can't support more range other than -1.45 to 1.45. For above this we have to do reduction operations with modulo operations which is not available.

hschoi4448 · 2024-04-16T04:01:13Z

@hschoi4448 for all the above tagged issues has hardware and performance limitations where handling/storing Nan /inf is the problem
* In WHB0 storing Nan is  not possible

* Storing Nan/inf in ckernel op is related to performance problem
If you check each issue we have added observation and even for few ops we have raised the PR with fix

PR is not merged due to pending approval from code owner

Please find @rtawfik01 comment below

These checks also affect performance, are the users alright with a performance hit?

Also for tan op we can't support more range other than -1.45 to 1.45. For above this we have to do reduction operations with modulo operations which is not available.

Understood. If it's a hardware issue with limitations on performance and functionality, it seems that it's not something I can decide on, so I'll pass it on to my team. @razorback3

VirdhatchaniKN · 2024-10-22T08:32:59Z

Updates :

[Bug Report] invalid acosh backward result #6583 - In Progress
[Bug Report] invalid isclose result #6672 - Assigned to TT Handle Inf, Nan at Kernel level for blocked ops #14077
[Bug Report] invalid tan result #6735
[Bug Report] Invalid log result #6721 - Assigned to TT Mul op returns wrong output when multiplied with NaN #12776
[Bug Report] Invalid recip result #6720 : PR Merged, Issue yet to close after discussion

VirdhatchaniKN · 2024-10-28T04:37:17Z

Updates :

[Bug Report] invalid acosh backward result #6583 - PR In Review
[Bug Report] invalid isclose result #6672 - Assigned to TT Handle Inf, Nan at Kernel level for blocked ops #14077
[Bug Report] invalid tan result #6735
[Bug Report] Invalid log result #6721 - Assigned to TT Mul op returns wrong output when multiplied with NaN #12776
[Bug Report] Invalid recip result #6720 : Issue closed

zzigler-tt · 2025-01-14T23:20:38Z

@eyonland @umadevimcw Can you please advise when this item will be unblocked, in addition to a realistic remediation timeline? TY

@prajaramanTT FYI

hschoi4448 added the feature-request External feature request label Mar 21, 2024

github-project-automation bot added this to External Requests and Reports Mar 21, 2024

github-project-automation bot moved this to 🆕 New in External Requests and Reports Mar 21, 2024

jvasilje assigned umadevimcw Mar 21, 2024

jvasilje added P0 bug Something isn't working labels Mar 21, 2024

jliangTT added op_cat: eltwise and removed feature-request External feature request labels Mar 21, 2024

umadevimcw mentioned this issue Mar 22, 2024

Eltwise / Reduce / Broadcast Related tasks/bugs - MCW #6445

Open

44 tasks

umadevimcw assigned VirdhatchaniKN Mar 22, 2024

umadevimcw pushed a commit that referenced this issue Mar 22, 2024

#6633: Update backward test file

ce74310

VirdhatchaniKN added a commit that referenced this issue Mar 22, 2024

#6633: Update backward test file

92fd8ab

VirdhatchaniKN added a commit that referenced this issue Mar 22, 2024

#6633: Update backward test file

8623395

umadevimcw added a commit that referenced this issue Mar 22, 2024

#6633: Update data gen func that fill the constant val

8b4d575

umadevimcw added a commit that referenced this issue Mar 22, 2024

#6633: Update acos backward test

6a7b629

umadevimcw added a commit that referenced this issue Mar 22, 2024

#6633: Update acos backward test

0834494

VirdhatchaniKN pushed a commit that referenced this issue Mar 22, 2024

#6633: Update data gen func that fill the constant val

51d78c4

umadevimcw added a commit that referenced this issue Mar 22, 2024

#6633: Update unit test files for acosh

d049e73

VirdhatchaniKN added a commit that referenced this issue Mar 22, 2024

#6633: Update backward test file

d336f4a

VirdhatchaniKN pushed a commit that referenced this issue Mar 22, 2024

#6633: Update data gen func that fill the constant val

b62ec6d

VirdhatchaniKN pushed a commit that referenced this issue Mar 22, 2024

#6633: Update data gen func that fill the constant val

349c8e0

umadevimcw mentioned this issue Mar 22, 2024

[Bug Report] Invalid logit result #6678

Closed

umadevimcw added a commit that referenced this issue Mar 22, 2024

#6633: Update xlogy test

cf25335

umadevimcw added a commit that referenced this issue Mar 22, 2024

#6633: Update acos backward test

4f536a9

umadevimcw added a commit that referenced this issue Mar 22, 2024

#6633: Update unit test files for acosh

2db2abc

umadevimcw added a commit that referenced this issue Mar 22, 2024

#6633: Update xlogy test

fde9684

umadevimcw added a commit that referenced this issue Mar 27, 2024

#6633: Update tests for fill and fill zero

d56461a

umadevimcw added a commit that referenced this issue Mar 27, 2024

#6633: Update log2 backward ops

48cc927

umadevimcw added a commit that referenced this issue Mar 27, 2024

#6633: Refactor log1p and updates test files

3297052

umadevimcw added a commit that referenced this issue Mar 27, 2024

#6633: Refactor log10 and update test files

345f81a

umadevimcw added a commit that referenced this issue Mar 27, 2024

#6633: Updates test files for ldexp

500494a

umadevimcw added a commit that referenced this issue Mar 27, 2024

#6633: Update test files for logaddexp_exp2

e8e6892

umadevimcw added a commit that referenced this issue Mar 28, 2024

#6633: Updates test files for ldexp

022ac87

umadevimcw added a commit that referenced this issue Mar 28, 2024

#6633: Update test files for logaddexp_exp2

5792925

umadevimcw added a commit that referenced this issue Mar 28, 2024

#6633: Updates test files for ldexp

b3eefdf

umadevimcw added a commit that referenced this issue Mar 28, 2024

#6633: Update test files for logaddexp_exp2

d8237d0

jliangTT added P1 and removed P0 labels Apr 2, 2024

jliangTT closed this as completed Apr 11, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in External Requests and Reports Apr 11, 2024

razorback3 reopened this Apr 16, 2024

eyonland mentioned this issue Oct 15, 2024

Eltwise Master Tracking #13795

Open

umadevimcw added the master label Nov 5, 2024

eyonland added the MCW label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Improvement Needed for Unit Tests #6633

[Feature Request] Improvement Needed for Unit Tests #6633

hschoi4448 commented Mar 21, 2024 •

edited by jliangTT

Loading

razorback3 commented Mar 21, 2024 •

edited

Loading

jliangTT commented Mar 21, 2024

umadevimcw commented Mar 22, 2024

umadevimcw commented Mar 22, 2024 •

edited

Loading

jliangTT commented Apr 2, 2024

razorback3 commented Apr 3, 2024

jliangTT commented Apr 10, 2024

jliangTT commented Apr 11, 2024

razorback3 commented Apr 12, 2024

hschoi4448 commented Apr 16, 2024

hschoi4448 commented Apr 16, 2024

umadevimcw commented Apr 16, 2024 •

edited

Loading

hschoi4448 commented Apr 16, 2024

VirdhatchaniKN commented Oct 22, 2024 •

edited

Loading

VirdhatchaniKN commented Oct 28, 2024

zzigler-tt commented Jan 14, 2025

[Feature Request] Improvement Needed for Unit Tests #6633

[Feature Request] Improvement Needed for Unit Tests #6633

Comments

hschoi4448 commented Mar 21, 2024 • edited by jliangTT Loading

razorback3 commented Mar 21, 2024 • edited Loading

jliangTT commented Mar 21, 2024

umadevimcw commented Mar 22, 2024

umadevimcw commented Mar 22, 2024 • edited Loading

jliangTT commented Apr 2, 2024

razorback3 commented Apr 3, 2024

jliangTT commented Apr 10, 2024

jliangTT commented Apr 11, 2024

razorback3 commented Apr 12, 2024

hschoi4448 commented Apr 16, 2024

hschoi4448 commented Apr 16, 2024

umadevimcw commented Apr 16, 2024 • edited Loading

hschoi4448 commented Apr 16, 2024

VirdhatchaniKN commented Oct 22, 2024 • edited Loading

VirdhatchaniKN commented Oct 28, 2024

zzigler-tt commented Jan 14, 2025

hschoi4448 commented Mar 21, 2024 •

edited by jliangTT

Loading

razorback3 commented Mar 21, 2024 •

edited

Loading

umadevimcw commented Mar 22, 2024 •

edited

Loading

umadevimcw commented Apr 16, 2024 •

edited

Loading

VirdhatchaniKN commented Oct 22, 2024 •

edited

Loading