CuDNN test fails with parallel pytest-xdist #9128

bzamecnik · 2018-01-19T14:56:28Z

The following fails. In pytest.ini we specify two cores (-n 2) so the tests run in parallel.

py.test -s tests/keras/layers/cudnn_recurrent_test.py

Some observed error messages from TensorFlow:

UnknownError: Fail to find the dnn implementation.
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(32, 2), b.shape=(2, 2), m=32, n=2, k=2
e = InternalError(), message = 'GPU sync failed', m = None

If we disable pytest-xdist (-n 0) or use just a single core (-n 1) it works ok:

py.test -n 1 tests/keras/layers/cudnn_recurrent_test.py

Note that CuDNN tests require GPU (@pytest.mark.skipif) and are not ran on Travis CI, so this problem only appears with manual test invocation.

A workaround is to run this test file with single process (with -n 1 as above), which should be documented somewhere.

A better solution would be to enforce serial execution for tests in this file. So far it doesn't seem that pytest-xdist supports that directly.

System info:

The text was updated successfully, but these errors were encountered:

fchollet closed this as completed Jun 24, 2021

Provide feedback