Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing nan/inf related to bugs #11243

Merged
merged 6 commits into from
Sep 5, 2024
Merged

Conversation

umadevimcw
Copy link
Contributor

@umadevimcw umadevimcw commented Aug 9, 2024

Ticket

#6720
#6722
#6723
#6991

Problem description

  • Addressing the issues for the ops that have nan/inf-related issues

What's changed

Based on our discussion we will be replacing the nan / inf output from the torch with the numbers to match the TT output
For this, I have used a torch.nan_to_num function

Checklist

@umadevimcw umadevimcw force-pushed the umadevimcw/nan_inf_issue branch from 2eda24a to 9943468 Compare August 26, 2024 10:44
@umadevimcw umadevimcw marked this pull request as ready for review August 26, 2024 11:00
@umadevimcw umadevimcw force-pushed the umadevimcw/nan_inf_issue branch from 9540dee to c6cd9de Compare September 5, 2024 06:04
@umadevimcw umadevimcw merged commit f69ab8f into main Sep 5, 2024
107 checks passed
@umadevimcw umadevimcw deleted the umadevimcw/nan_inf_issue branch September 5, 2024 07:26
static constexpr float NAN_BH = NAN_WHB0;

static constexpr float INF_GS = 1.6948e38;
static constexpr float INF_WHB0 = 1.7014e+38;
Copy link
Contributor

@razorback3 razorback3 Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you find this number?
I mean INF_WHB0.

I set every element of both bfloat16 tiles A and B to the max value and add two tiles.
Then, what I get is 0x7FB0 in every element of the output tile which equals to the PyTorch output and this.

I tested this in Wormhole B0.

Copy link
Contributor

@razorback3 razorback3 Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if I print INF_WHB0 in hexadecimal, I get 7EFFFF8B.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umadevimcw Any comment about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

When I tried to do operation that results in Nan/inf i got this value

Copy link
Contributor

@razorback3 razorback3 Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/Handling_Special_Value/special_values.md#representation
I get the same values noted in this page.
But the numbers you added in the file differs.
I don't know how the above numbers in the file (tt_metal/impl/device/device.hpp) can be generated.

@ttmtrajkovic Can you give any advice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umadevimcw
Would you tell me how can I reproduce the output you got?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@razorback3 I ran the test mentioned in this issue #6720 #6722 (as mentioned in the description on this PR) and observed the values updated in the device.hpp file. I didn't do any changes i ran the tests mentioned in the issues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@razorback3,
The NaN values from device.hpp are incorrect, I am not sure where did they come from but it’s possible that they got added before I got to define them through the tech report.
I will find the owner of this file and update this, but in the meantime, please update locally to match values I’ve specified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umadevimcw
I think you misunderstood the output value from each test (issue).
In #6720, it says that the NPU results "169476569462576773795235400185743933440.00000" rather than "Inf".
In #6722, it says that the NPU results "70039981404865953792.00000" rather than "NaN".

However, it does not mean those values are NaN/Infs in the NPU.
This is because NaN/Infs occurred in the NPU and then changed to some other numbers during multiple computation steps.
If you see this page, there are some cases where NaN/Infs are not propagated as intended.
So, if you make another op implementation that can produce NaN/Infs but fails to propagate NaN/Infs correctly, it would output other values than those that are stated in the device.hpp file.

To conclude, as ttmtrajkovic stated, the correct representation of NaN/Infs in NPU are as summarized in here which differs from your commit. We have to think about other ways to correctly handle the failing test cases.

Please share your thoughts. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@razorback3
#8945 (comment)

As discussed here developer/user have to handle the storage of Nan /inf due to hardware limitations.
Hence we have used nan_to_num to compare Nan to numbers. As mentioned in earlier comments special value doc are updated recently ( got into repo after this PR) updating param with correct values is appropriate one. Will update it in separate PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants