-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addressing nan/inf related to bugs #11243
Conversation
2eda24a
to
9943468
Compare
9540dee
to
c6cd9de
Compare
static constexpr float NAN_BH = NAN_WHB0; | ||
|
||
static constexpr float INF_GS = 1.6948e38; | ||
static constexpr float INF_WHB0 = 1.7014e+38; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you find this number?
I mean INF_WHB0.
I set every element of both bfloat16 tiles A and B to the max value and add two tiles.
Then, what I get is 0x7FB0 in every element of the output tile which equals to the PyTorch output and this.
I tested this in Wormhole B0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if I print INF_WHB0 in hexadecimal, I get 7EFFFF8B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umadevimcw Any comment about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
When I tried to do operation that results in Nan/inf i got this value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/Handling_Special_Value/special_values.md#representation
I get the same values noted in this page.
But the numbers you added in the file differs.
I don't know how the above numbers in the file (tt_metal/impl/device/device.hpp) can be generated.
@ttmtrajkovic Can you give any advice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umadevimcw
Would you tell me how can I reproduce the output you got?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@razorback3 I ran the test mentioned in this issue #6720 #6722 (as mentioned in the description on this PR) and observed the values updated in the device.hpp file. I didn't do any changes i ran the tests mentioned in the issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@razorback3,
The NaN values from device.hpp are incorrect, I am not sure where did they come from but it’s possible that they got added before I got to define them through the tech report.
I will find the owner of this file and update this, but in the meantime, please update locally to match values I’ve specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umadevimcw
I think you misunderstood the output value from each test (issue).
In #6720, it says that the NPU results "169476569462576773795235400185743933440.00000" rather than "Inf".
In #6722, it says that the NPU results "70039981404865953792.00000" rather than "NaN".
However, it does not mean those values are NaN/Infs in the NPU.
This is because NaN/Infs occurred in the NPU and then changed to some other numbers during multiple computation steps.
If you see this page, there are some cases where NaN/Infs are not propagated as intended.
So, if you make another op implementation that can produce NaN/Infs but fails to propagate NaN/Infs correctly, it would output other values than those that are stated in the device.hpp file.
To conclude, as ttmtrajkovic stated, the correct representation of NaN/Infs in NPU are as summarized in here which differs from your commit. We have to think about other ways to correctly handle the failing test cases.
Please share your thoughts. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed here developer/user have to handle the storage of Nan /inf due to hardware limitations.
Hence we have used nan_to_num to compare Nan to numbers. As mentioned in earlier comments special value doc are updated recently ( got into repo after this PR) updating param with correct values is appropriate one. Will update it in separate PR
Ticket
#6720
#6722
#6723
#6991
Problem description
What's changed
Based on our discussion we will be replacing the nan / inf output from the torch with the numbers to match the TT output
For this, I have used a
torch.nan_to_num
functionChecklist