Non-deterministic wrong result from tensordot #79

manopapad · 2021-08-19T01:29:47Z

This program (derived from tests/tensordot.py):

import legate.numpy as lg
import numpy as np

a = lg.random.rand(3, 5, 4).astype(np.float16)
b = lg.random.rand(4, 5, 3).astype(np.float16)

a = lg.random.rand(3, 5, 4).astype(np.float16)
b = lg.random.rand(5, 4, 3).astype(np.float16)
cn = np.tensordot(a, b)
print('cn', flush=True)
print(cn, flush=True)
c = lg.tensordot(a, b)
print('c', flush=True)
print(c, flush=True)

assert np.allclose(cn, c)

when run as follows:

LEGATE_TEST=1 legate 79.py -lg:numpy:test --cpus 4

fails about 20% of the time, with:

cn
[[4.07  4.83  5.01 ]
 [4.2   4.562 5.863]
 [4.344 4.52  3.914]]
c
[[4.07  4.83  5.01 ]
 [4.2   4.562 5.863]
 [4.344 4.52  3.916]]
[0 - 700005133000]    0.946367 {6}{python}: python exception occurred within task:
Traceback (most recent call last):
  File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 410, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 234, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "79.py", line 16, in <module>
    assert np.allclose(cn, c)
AssertionError

The text was updated successfully, but these errors were encountered:

magnatelee · 2021-08-30T20:29:09Z

I feel discrepancies like this with float16 are inevitable, because of region reductions being non-deterministic. Do we know that this issue reproduces on a single CPU? Otherwise, I wouldn't worry too much about this issue.

lightsighter · 2021-08-31T09:27:50Z

I think @manopapad and I have convinced ourselves that this is a precision issue. np.allclose has a fixed tolerance regardless of the underlying types. The resulting value under an error run is 1 ulp away from the "correct" value. It's highly likely then that something got rounded off depending on the ordering. The solution should be to relax the tolerance for np.allclose for 16-bit floating point types in accordance with their reduced precision.

manopapad · 2021-09-07T17:18:24Z

@marcinz It sounds like 0.1% relative accuracy is a more reasonable expectation when dealing with 16-bit floating point types. Could you change np.allclose on test runs using float16, so we don't get spurious failures in CI?

Port fix for #79

manopapad · 2021-12-15T19:06:49Z

I have ported @marcinz's fix from branch-21.10

Minor fix for the cost function

manopapad added the bug Something isn't working label Aug 19, 2021

marcinz self-assigned this Aug 25, 2021

manopapad added a commit that referenced this issue Dec 15, 2021

Merge pull request #155 from manopapad/port_79_fix

f7f69d1

Port fix for #79

manopapad closed this as completed Dec 15, 2021

fduguet-nv pushed a commit to fduguet-nv/cunumeric that referenced this issue Mar 29, 2022

Merge pull request nv-legate#79 from magnatelee/cost-fix

325c9e5

Minor fix for the cost function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic wrong result from tensordot #79

Non-deterministic wrong result from tensordot #79

manopapad commented Aug 19, 2021

magnatelee commented Aug 30, 2021

lightsighter commented Aug 31, 2021

manopapad commented Sep 7, 2021

manopapad commented Dec 15, 2021

Non-deterministic wrong result from tensordot #79

Non-deterministic wrong result from tensordot #79

Comments

manopapad commented Aug 19, 2021

magnatelee commented Aug 30, 2021

lightsighter commented Aug 31, 2021

manopapad commented Sep 7, 2021

manopapad commented Dec 15, 2021