Test failures with GPUs #276

bwohlberg · 2022-04-13T16:15:11Z

Two unit tests on main are failing when run on GPUs. One of them should be easy to resolve via the usual strategy of changing a comparison tolerance, but the other one looks a bit more serious.

=================================== FAILURES ===================================
__________________________ test_minimize[CG-float32] ___________________________

dtype = <class 'jax.numpy.float32'>, method = 'CG'

    @pytest.mark.parametrize("dtype", [snp.float32, snp.complex64])
    @pytest.mark.parametrize("method", ["CG", "L-BFGS-B"])
    def test_minimize(dtype, method):
	from scipy.linalg import block_diag

	B, M, N = (4, 3, 2)

	# Models a 12x8 block-diagonal matrix with 4x3 blocks
	A, key = random.randn((B, M, N), dtype=dtype)
	x, key = random.randn((B, N), dtype=dtype)
	y = snp.sum(A * x[:, None], axis=2)  # contract along the N axis

	# result by directly inverting the dense matrix
	A_mat = block_diag(*A)
>       expected = np.linalg.pinv(A_mat) @ y.ravel()

scico/test/test_solver.py:172:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
<__array_function__ internals>:5: in pinv
    ???
/miniconda3/envs/py39gpu/lib/python3.9/site-packages/numpy/linalg/linalg.py:2002: in pinv
    u, s, vt = svd(a, full_matrices=False, hermitian=hermitian)
<__array_function__ internals>:5: in svd
    ???
/miniconda3/envs/py39gpu/lib/python3.9/site-packages/numpy/linalg/linalg.py:1660: in svd
    u, s, vh = gufunc(a, signature=signature, extobj=extobj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

err = 'invalid value', flag = 8

    def _raise_linalgerror_svd_nonconvergence(err, flag):
>       raise LinAlgError("SVD did not converge")
E       numpy.linalg.LinAlgError: SVD did not converge

/miniconda3/envs/py39gpu/lib/python3.9/site-packages/numpy/linalg/linalg.py:97: LinAlgError
_________________________ test_binary_op[float32-add] __________________________

testobj = <test_linop.LinearOperatorTestObj object at 0x7faa2049e160>
operator = <built-in function add>

    @pytest.mark.parametrize("operator", [op.add, op.sub])
    def test_binary_op(testobj, operator):
	# Our AbsMatOp class does not override the __add__, etc
	# so AbsMatOp + AbsMatOp -> LinearOperator
	# So to verify results, we evaluate the new LinearOperator on a random input

	comp_mat = operator(testobj.A, testobj.B)  # composite matrix
	comp_op = operator(testobj.Ao, testobj.Bo)  # composite linop

	assert isinstance(comp_op, linop.LinearOperator)  # Ensure we don't get a Map
	assert comp_op.input_dtype == testobj.A.dtype
>       np.testing.assert_allclose(comp_mat @ testobj.x, comp_op @ testobj.x, rtol=5e-5)
E       AssertionError:
E       Not equal to tolerance rtol=5e-05, atol=0
E
E       Mismatched elements: 1 / 8 (12.5%)
E       Max absolute difference: 9.536743e-07
E       Max relative difference: 7.0016424e-05
E        x: array([-2.417045,  1.69006 ,  2.91617 ,  0.009365,  3.242247,  9.085916,
E               7.729687, -7.39012 ], dtype=float32)
E        y: array([-2.417045,  1.69006 ,  2.916171,  0.009364,  3.242248,  9.085917,
E               7.729687, -7.39012 ], dtype=float32)

scico/test/linop/test_linop.py:108: AssertionError
=========================== short test summary info ============================
FAILED scico/test/test_solver.py::test_minimize[CG-float32] - numpy.linalg.Li...
FAILED scico/test/linop/test_linop.py::test_binary_op[float32-add] - Assertio...
====== 2 failed, 3071 passed, 3 skipped, 11 xfailed in 742.76s (0:12:22) =======

The text was updated successfully, but these errors were encountered:

Michael-T-McCann · 2022-07-28T18:28:41Z

I only get the second (easier to fix) failure when I run these, with jaxlib == 0.3.5+cuda11.cudnn82 and jax == 0.3.5. @bwohlberg can you help me reproduce this?

bwohlberg · 2022-08-01T16:12:46Z

I was able to replicate the error with Python 3.9.12, jaxlib 0.3.5 (0.3.5+cuda11.cudnn82) and jax 0.36. It does not occur if one just runs pytest -x scico/test/test_solver.py, so there must be some sort of state-dependency related to tests that have run before the failing test.

bwohlberg added bug Something isn't working tests Pertaining to SCICO tests labels Apr 13, 2022

bwohlberg added this to the Release 0.0.3 milestone Jul 1, 2022

Michael-T-McCann self-assigned this Jul 28, 2022

Michael-T-McCann mentioned this issue Jul 28, 2022

Bump test tolerance #318

Merged

bwohlberg pushed a commit that referenced this issue Aug 1, 2022

Resolve #276

aab93c1

bwohlberg mentioned this issue Aug 2, 2022

Miscellaneous changes #321

Merged

Michael-T-McCann closed this as completed in 45ea985 Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failures with GPUs #276

Test failures with GPUs #276

bwohlberg commented Apr 13, 2022

Michael-T-McCann commented Jul 28, 2022

bwohlberg commented Aug 1, 2022

Test failures with GPUs #276

Test failures with GPUs #276

Comments

bwohlberg commented Apr 13, 2022

Michael-T-McCann commented Jul 28, 2022

bwohlberg commented Aug 1, 2022