Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic wrong result from tensordot #79

Closed
manopapad opened this issue Aug 19, 2021 · 4 comments
Closed

Non-deterministic wrong result from tensordot #79

manopapad opened this issue Aug 19, 2021 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@manopapad
Copy link
Contributor

This program (derived from tests/tensordot.py):

import legate.numpy as lg
import numpy as np

a = lg.random.rand(3, 5, 4).astype(np.float16)
b = lg.random.rand(4, 5, 3).astype(np.float16)

a = lg.random.rand(3, 5, 4).astype(np.float16)
b = lg.random.rand(5, 4, 3).astype(np.float16)
cn = np.tensordot(a, b)
print('cn', flush=True)
print(cn, flush=True)
c = lg.tensordot(a, b)
print('c', flush=True)
print(c, flush=True)

assert np.allclose(cn, c)

when run as follows:

LEGATE_TEST=1 legate 79.py -lg:numpy:test --cpus 4

fails about 20% of the time, with:

cn
[[4.07  4.83  5.01 ]
 [4.2   4.562 5.863]
 [4.344 4.52  3.914]]
c
[[4.07  4.83  5.01 ]
 [4.2   4.562 5.863]
 [4.344 4.52  3.916]]
[0 - 700005133000]    0.946367 {6}{python}: python exception occurred within task:
Traceback (most recent call last):
  File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 410, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 234, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "79.py", line 16, in <module>
    assert np.allclose(cn, c)
AssertionError
@manopapad manopapad added the bug Something isn't working label Aug 19, 2021
@marcinz marcinz self-assigned this Aug 25, 2021
@magnatelee
Copy link
Contributor

I feel discrepancies like this with float16 are inevitable, because of region reductions being non-deterministic. Do we know that this issue reproduces on a single CPU? Otherwise, I wouldn't worry too much about this issue.

@lightsighter
Copy link
Contributor

I think @manopapad and I have convinced ourselves that this is a precision issue. np.allclose has a fixed tolerance regardless of the underlying types. The resulting value under an error run is 1 ulp away from the "correct" value. It's highly likely then that something got rounded off depending on the ordering. The solution should be to relax the tolerance for np.allclose for 16-bit floating point types in accordance with their reduced precision.

@manopapad
Copy link
Contributor Author

@marcinz It sounds like 0.1% relative accuracy is a more reasonable expectation when dealing with 16-bit floating point types. Could you change np.allclose on test runs using float16, so we don't get spurious failures in CI?

manopapad added a commit that referenced this issue Dec 15, 2021
@manopapad
Copy link
Contributor Author

I have ported @marcinz's fix from branch-21.10

fduguet-nv pushed a commit to fduguet-nv/cunumeric that referenced this issue Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants