Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Improvement Needed for Unit Tests #6633

Open
4 of 5 tasks
Tracked by #6445 ...
hschoi4448 opened this issue Mar 21, 2024 · 17 comments
Open
4 of 5 tasks
Tracked by #6445 ...

[Feature Request] Improvement Needed for Unit Tests #6633

hschoi4448 opened this issue Mar 21, 2024 · 17 comments
Assignees
Labels
bug Something isn't working master MCW moreh moreh contribution op_cat: eltwise P1

Comments

@hschoi4448
Copy link
Contributor

hschoi4448 commented Mar 21, 2024

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I recently reviewed the backward ops and found several bugs.

  1. [Bug Report] invalid ldexp backward result #6533
  2. [Bug Report] invalid hardsigmoid backward result #6534
  3. [Bug Report] invalid asin_bw backward result #6536
  4. [Bug Report] invalid acosh backward result #6583
  5. [Bug Report] invalid relu6 backward result #6589
  6. [Bug Report] invalid softplus backward result #6598

I believe there are two main reasons why there were so many bugs in backward ops:

  1. The compare_result function often returns a pass even when the correct value and the TT result differ significantly. This may result in many bugs going unnoticed.
    ex) [Bug Report] invalid softplus backward result #6598
    image

  2. The input data used in unit tests does not always reflect the characteristics of the ops
    For instance, in the case of relu6, the gradient formula varies depending on whether the input falls within the range of 0 to 6. Therefore, to test all intervals effectively, the input data should include values around 0, 6, and nearby points, such as [-1, 0, 3, 6, 7].
    However, currently, input data is generated using torch.randn, results in values mostly around [-1 to 1], neglecting testing in the vicinity of 6 and its surrounding intervals.

ex)
image

I didn't run all unit tests during the review, and only checked the suspicious parts, so I believe there are actually more bugs.
Improving unit tests seems to be a high priority to address recurring issues and find hidden bugs.

Describe the solution you'd like
A clear and concise description of what you want to happen.

  1. Regarding the first issue, I'm not sure what the best solution would be.
  2. For the second issue, a method is needed to input specific values related to the characteristics of the op as test data.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@razorback3
Copy link
Contributor

razorback3 commented Mar 21, 2024

I think compare_result may be changed to pass both pcc check and allclose check.
(For right now, I see that tests are made to be passed when one of the checks is passed)
Moreover, allowing user to set atol and rtol for compare_result should be added.

@jliangTT
Copy link

yes. @razorback3. i will follow up and get back to you with a response. This is really not good.

@jvasilje jvasilje added P0 bug Something isn't working labels Mar 21, 2024
@jliangTT jliangTT added op_cat: eltwise and removed feature-request External feature request labels Mar 21, 2024
umadevimcw pushed a commit that referenced this issue Mar 22, 2024
umadevimcw added a commit that referenced this issue Mar 22, 2024
umadevimcw added a commit that referenced this issue Mar 22, 2024
@umadevimcw
Copy link
Contributor

@razorback3 @jliangTT @jvasilje We have updated the datagen and comparison function in PR #6679 for debugging.
We are parallel working on updating the test files with reference to this PR. Once the PR is merged with the main will submit the changes for all the Ops in a separate PR

@umadevimcw
Copy link
Contributor

umadevimcw commented Mar 22, 2024

@razorback3 @jliangTT @jvasilje @hschoi4448 We are also observing the following scenario (not sure whether this is an expected behavior),

  • if the input is filled with constant 1.0 we are getting nan as expected as a result
  • if the input is randomly filled for the same value 1.0 we are getting a larger number as a result (as shown in the image) which results in a PCC failure issue

To handle this we have to add the condition separately in the logic. We are checking this kind of issue also.

Scenario 1:

image

Scenario 2:

image

Also few ops depends on this issue as well #6676

@jliangTT
Copy link

jliangTT commented Apr 2, 2024

Status:

I think we can downgrade this to p1.

@jliangTT jliangTT added P1 and removed P0 labels Apr 2, 2024
@razorback3
Copy link
Contributor

@jliangTT
Would you give access permission of google docs to me and @hschoi4448 ?

[email protected]
[email protected]

@jliangTT
Copy link

please see this doc - https://docs.google.com/spreadsheets/d/1VV-EwGJn1EgBN3jX3tg4TcX_yDkm5HAO/edit#gid=66577367

(sorry i have to made a copy due to not being able to get around access control)

@jliangTT
Copy link

will close this one for now. Can track the issue. Please re-open if you have any concerns.

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in External Requests and Reports Apr 11, 2024
@razorback3
Copy link
Contributor

OK. @hschoi4448 will double-check the result when he comes back from his vacation.

@hschoi4448
Copy link
Contributor Author

#6583 still has a problem.

  1. Test cdoe
# SPDX-FileCopyrightText: © 2023 Tenstorrent Inc.

# SPDX-License-Identifier: Apache-2.0

import torch
import pytest
import tt_lib
from tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs import data_gen_pt_tt, compare_results

def data_gen_pt_tt(input_shapes, device, required_grad=False, val=1):
    pt_tensor = (torch.ones(input_shapes, requires_grad=required_grad) * val).bfloat16()
    tt_tensor = (
        tt_lib.tensor.Tensor(pt_tensor, tt_lib.tensor.DataType.BFLOAT16).to(tt_lib.tensor.Layout.TILE).to(device)
    )
    return pt_tensor, tt_tensor

@pytest.mark.parametrize(
    "input_shapes",
    (
        (torch.Size([1, 1, 32, 32])),
    ),
)
def test_bw_acosh(input_shapes, device):
    in_data, input_tensor = data_gen_pt_tt(input_shapes, device, True, val=0.5)
    grad_data, grad_tensor = data_gen_pt_tt(input_shapes, device, False, val=1)

    print("input_tensor", input_tensor)
    print("grad_tensor", grad_tensor)
    
    pyt_y = torch.acosh(in_data)

    tt_output_tensor_on_device = tt_lib.tensor.acosh_bw(grad_tensor, input_tensor)

    in_data.retain_grad()

    pyt_y.backward(gradient=grad_data)

    golden_tensor = [in_data.grad]

    comp_pass = compare_results(tt_output_tensor_on_device, golden_tensor)
    
    print("tt_output_tensor_on_device", tt_output_tensor_on_device)
    print("golden_tensor", golden_tensor)
    assert comp_pass
  1. output
input_tensor ttnn.Tensor([[[[ 0.50000,  0.50000,  ...,  0.50000,  0.50000],
               [ 0.50000,  0.50000,  ...,  0.50000,  0.50000],
               ...,
               [ 0.50000,  0.50000,  ...,  0.50000,  0.50000],
               [ 0.50000,  0.50000,  ...,  0.50000,  0.50000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
grad_tensor ttnn.Tensor([[[[ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
2024-04-16 01:45:38.796 | ERROR    | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:32 - One tensor is all nan, the other is not.
2024-04-16 01:45:38.796 | ERROR    | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:32 - One tensor is all nan, the other is not.
2024-04-16 01:45:38.797 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:62 - False
2024-04-16 01:45:38.797 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:63 - False
2024-04-16 01:45:38.797 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:64 - Max ATOL Delta: nan, Max RTOL Delta: nan, PCC: 0.0, PCC check failed
tt_output_tensor_on_device [ttnn.Tensor([[[[inf     , inf     ,  ..., inf     , inf     ],
               [inf     , inf     ,  ..., inf     , inf     ],
               ...,
               [inf     , inf     ,  ..., inf     , inf     ],
               [inf     , inf     ,  ..., inf     , inf     ]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)]
golden_tensor [tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]]]], dtype=torch.bfloat16)]

@razorback3 razorback3 reopened this Apr 16, 2024
@hschoi4448
Copy link
Contributor Author

@umadevimcw
Copy link
Contributor

umadevimcw commented Apr 16, 2024

@hschoi4448 for all the above tagged issues has hardware and performance limitations where handling/storing Nan /inf is the problem

  • In WHB0 storing Nan is not possible
  • Storing Nan/inf in ckernel op is related to performance problem

If you check each issue we have added observation and even for few ops we have raised the PR with fix

PR is not merged due to pending approval from code owner

Please find @rtawfik01 comment below

These checks also affect performance, are the users alright with a performance hit?

Also for tan op we can't support more range other than -1.45 to 1.45. For above this we have to do reduction operations with modulo operations which is not available.

@hschoi4448
Copy link
Contributor Author

@hschoi4448 for all the above tagged issues has hardware and performance limitations where handling/storing Nan /inf is the problem

* In WHB0 storing Nan is  not possible

* Storing Nan/inf in ckernel op is related to performance problem

If you check each issue we have added observation and even for few ops we have raised the PR with fix

PR is not merged due to pending approval from code owner

Please find @rtawfik01 comment below

These checks also affect performance, are the users alright with a performance hit?

Also for tan op we can't support more range other than -1.45 to 1.45. For above this we have to do reduction operations with modulo operations which is not available.

Understood. If it's a hardware issue with limitations on performance and functionality, it seems that it's not something I can decide on, so I'll pass it on to my team. @razorback3

@zzigler-tt
Copy link

@eyonland @umadevimcw Can you please advise when this item will be unblocked, in addition to a realistic remediation timeline? TY

@prajaramanTT FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working master MCW moreh moreh contribution op_cat: eltwise P1
Projects
None yet
Development

No branches or pull requests

9 participants