Addressing nan/inf related to bugs #11243

umadevimcw · 2024-08-09T07:38:33Z

Ticket

Problem description

Addressing the issues for the ops that have nan/inf-related issues

What's changed

Based on our discussion we will be replacing the nan / inf output from the torch with the numbers to match the TT output
For this, I have used a torch.nan_to_num function

Checklist

Post commit CI passes

tests/ttnn/unit_tests/operations/test_math.py

razorback3 · 2024-10-11T08:17:46Z

tt_metal/impl/device/device.hpp

+static constexpr float  NAN_BH = NAN_WHB0;
+
+static constexpr float  INF_GS = 1.6948e38;
+static constexpr float  INF_WHB0 = 1.7014e+38;


How did you find this number?
I mean INF_WHB0.

I set every element of both bfloat16 tiles A and B to the max value and add two tiles.
Then, what I get is 0x7FB0 in every element of the output tile which equals to the PyTorch output and this.

I tested this in Wormhole B0.

Also, if I print INF_WHB0 in hexadecimal, I get 7EFFFF8B.

@umadevimcw Any comment about this?

Hi,

When I tried to do operation that results in Nan/inf i got this value

https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/Handling_Special_Value/special_values.md#representation
I get the same values noted in this page.
But the numbers you added in the file differs.
I don't know how the above numbers in the file (tt_metal/impl/device/device.hpp) can be generated.

@ttmtrajkovic Can you give any advice?

@umadevimcw
Would you tell me how can I reproduce the output you got?

@razorback3 I ran the test mentioned in this issue #6720 #6722 (as mentioned in the description on this PR) and observed the values updated in the device.hpp file. I didn't do any changes i ran the tests mentioned in the issues

@razorback3,
The NaN values from device.hpp are incorrect, I am not sure where did they come from but it’s possible that they got added before I got to define them through the tech report.
I will find the owner of this file and update this, but in the meantime, please update locally to match values I’ve specified.

@umadevimcw
I think you misunderstood the output value from each test (issue).
In #6720, it says that the NPU results "169476569462576773795235400185743933440.00000" rather than "Inf".
In #6722, it says that the NPU results "70039981404865953792.00000" rather than "NaN".

However, it does not mean those values are NaN/Infs in the NPU.
This is because NaN/Infs occurred in the NPU and then changed to some other numbers during multiple computation steps.
If you see this page, there are some cases where NaN/Infs are not propagated as intended.
So, if you make another op implementation that can produce NaN/Infs but fails to propagate NaN/Infs correctly, it would output other values than those that are stated in the device.hpp file.

To conclude, as ttmtrajkovic stated, the correct representation of NaN/Infs in NPU are as summarized in here which differs from your commit. We have to think about other ways to correctly handle the failing test cases.

Please share your thoughts. Thanks.

@razorback3
#8945 (comment)

As discussed here developer/user have to handle the storage of Nan /inf due to hardware limitations.
Hence we have used nan_to_num to compare Nan to numbers. As mentioned in earlier comments special value doc are updated recently ( got into repo after this PR) updating param with correct values is appropriate one. Will update it in separate PR

eyonland reviewed Aug 12, 2024

View reviewed changes

tests/ttnn/unit_tests/operations/test_math.py Outdated Show resolved Hide resolved

umadevimcw force-pushed the umadevimcw/nan_inf_issue branch from 2eda24a to 9943468 Compare August 26, 2024 10:44

umadevimcw marked this pull request as ready for review August 26, 2024 11:00

umadevimcw requested review from arakhmati, patrickroberts, yan-zaretskiy, cfjchu, xanderchin, TT-BrianLiu, ayerofieiev-tt, dmakoviichuk-tt, razorback3 and dongjin-na as code owners August 26, 2024 11:00

umadevimcw temporarily deployed to dev August 26, 2024 13:08 — with GitHub Actions Inactive

umadevimcw temporarily deployed to dev August 26, 2024 13:18 — with GitHub Actions Inactive

umadevimcw temporarily deployed to dev August 26, 2024 13:21 — with GitHub Actions Inactive

umadevimcw temporarily deployed to production August 26, 2024 13:48 — with GitHub Actions Inactive

umadevimcw added 3 commits September 5, 2024 06:04

#0: Add nan/inf values in device file

efd7f45

#6991: Fix acosh and logit nan issue

a7e2ba1

#6991: Remove cskip test

c6cd9de

umadevimcw force-pushed the umadevimcw/nan_inf_issue branch from 9540dee to c6cd9de Compare September 5, 2024 06:04

umadevimcw temporarily deployed to dev September 5, 2024 06:06 — with GitHub Actions Inactive

umadevimcw temporarily deployed to dev September 5, 2024 06:18 — with GitHub Actions Inactive

umadevimcw temporarily deployed to dev September 5, 2024 06:21 — with GitHub Actions Inactive

umadevimcw had a problem deploying to dev September 5, 2024 06:21 — with GitHub Actions Failure

umadevimcw temporarily deployed to production September 5, 2024 06:27 — with GitHub Actions Inactive

razorback3 approved these changes Sep 5, 2024

View reviewed changes

umadevimcw temporarily deployed to dev September 5, 2024 07:16 — with GitHub Actions Inactive

umadevimcw merged commit f69ab8f into main Sep 5, 2024
107 checks passed

umadevimcw deleted the umadevimcw/nan_inf_issue branch September 5, 2024 07:26

razorback3 reviewed Oct 11, 2024

View reviewed changes

This was referenced Oct 22, 2024

Handle Inf, Nan at Kernel level for blocked ops #14077

Open

#7488: Update nan inf values #14332

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addressing nan/inf related to bugs #11243

Addressing nan/inf related to bugs #11243

umadevimcw commented Aug 9, 2024 •

edited

Loading

razorback3 Oct 11, 2024 •

edited

Loading

razorback3 Oct 11, 2024 •

edited

Loading

razorback3 Oct 14, 2024

umadevimcw Oct 14, 2024

razorback3 Oct 14, 2024 •

edited

Loading

razorback3 Oct 20, 2024

umadevimcw Oct 21, 2024

ttmtrajkovic Oct 21, 2024

razorback3 Oct 21, 2024

umadevimcw Oct 21, 2024

Addressing nan/inf related to bugs #11243

Addressing nan/inf related to bugs #11243

Conversation

umadevimcw commented Aug 9, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

razorback3 Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

razorback3 Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

razorback3 Oct 14, 2024

Choose a reason for hiding this comment

umadevimcw Oct 14, 2024

Choose a reason for hiding this comment

razorback3 Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

razorback3 Oct 20, 2024

Choose a reason for hiding this comment

umadevimcw Oct 21, 2024

Choose a reason for hiding this comment

ttmtrajkovic Oct 21, 2024

Choose a reason for hiding this comment

razorback3 Oct 21, 2024

Choose a reason for hiding this comment

umadevimcw Oct 21, 2024

Choose a reason for hiding this comment

umadevimcw commented Aug 9, 2024 •

edited

Loading

razorback3 Oct 11, 2024 •

edited

Loading

razorback3 Oct 11, 2024 •

edited

Loading

razorback3 Oct 14, 2024 •

edited

Loading