Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SD test failure #3826

Closed
fweik opened this issue Jul 27, 2020 · 16 comments · Fixed by #3987
Closed

SD test failure #3826

fweik opened this issue Jul 27, 2020 · 16 comments · Fixed by #3987

Comments

@fweik
Copy link
Contributor

fweik commented Jul 27, 2020

The Stokasian Dynamics GPU test fails on my machine with

======================================================================
FAIL: test_default_ft (__main__.StokesianDynamicsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ssd/fweik/espresso/cmake-build-release/testsuite/python/stokesian_dynamics_gpu.py", line 42, in test_default_ft
    self.falling_spheres(1.0, 1.0, 1.0, 'ft', sd_short=True)
  File "/ssd/fweik/espresso/cmake-build-release/testsuite/python/stokesian_dynamics.py", line 122, in falling_spheres
    self.assertLess(idx, intsteps, msg='Insufficient sampling')
AssertionError: 1300 not less than 1300 : Insufficient sampling

======================================================================
FAIL: test_default_fts (__main__.StokesianDynamicsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ssd/fweik/espresso/cmake-build-release/testsuite/python/stokesian_dynamics_gpu.py", line 39, in test_default_fts
    self.falling_spheres(1.0, 1.0, 1.0, 'fts', sd_short=True)
  File "/ssd/fweik/espresso/cmake-build-release/testsuite/python/stokesian_dynamics.py", line 122, in falling_spheres
    self.assertLess(idx, intsteps, msg='Insufficient sampling')
AssertionError: 1300 not less than 1300 : Insufficient sampling

System Info

/tikhome/fweik/Base/opt/clion-2020.1/bin/cmake/linux/bin/cmake -DCMAKE_BUILD_TYPE=Release -DWITH_CCACHE=ON -DWITH_CUDA=ON -G "CodeBlocks - Unix Makefiles" /ssd/fweik/espresso
-- CMake version: 3.16.5
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found ccache /usr/bin/ccache
-- Config file: /ssd/fweik/espresso/cmake-build-release/myconfig.hpp
-- Performing Test result__PRETTY_FUNCTION__
-- Performing Test result__PRETTY_FUNCTION__ - Success
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr (found suitable version "9.1", minimum required is "9.0") 
-- Found CudaCompilerNVCC: /usr/bin/nvcc (found version "9.1.85") 
-- Found PythonInterp: /usr/bin/python3 (found suitable version "3.6.9", minimum required is "3.5") 
-- Found Cython: /usr/bin/python3;-m;cython (found suitable version "0.26.1", minimum required is "0.26") 
-- Found PythonHeaders: /usr/include/python3.6m  
-- Found NumPy: /usr/lib/python3/dist-packages/numpy/core/include (found version "1.13.3") 
-- Found FFTW3: /usr/lib/x86_64-linux-gnu/libfftw3.so  
-- HDF5: Using hdf5 compiler wrapper to determine C configuration
-- Found HDF5: /usr/lib/x86_64-linux-gnu/hdf5/openmpi/libhdf5.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found suitable version "1.10.0.1", minimum required is "1.8") found components: C 
-- Found Git: /usr/bin/git (found version "2.17.1") 
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") 
-- Found GSL: /usr/include (found version "2.4") 
-- Looking for sgemm_
-- Looking for sgemm_ - not found
-- Looking for sgemm_
-- Looking for sgemm_ - found
-- Found BLAS: /usr/lib/x86_64-linux-gnu/libopenblas.so  
-- Looking for cheev_
-- Looking for cheev_ - found
-- A library with LAPACK API found.
-- A library with LAPACK API found.
-- Found Boost: /usr/include (found suitable version "1.65.1", minimum required is "1.65")  
-- Found MPI_C: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so (found suitable version "3.1", minimum required is "3.0") 
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found suitable version "3.1", minimum required is "3.0") 
-- Found MPI: TRUE (found suitable version "3.1", minimum required is "3.0")  
-- Found Boost: /usr/include (found suitable version "1.65.1", minimum required is "1.65.0") found components: mpi serialization filesystem system unit_test_framework 
-- Found Doxygen: /usr/bin/doxygen (found version "1.8.13") found components: doxygen dot 
-- Found Sphinx: /usr/bin/python3;-m;sphinx (found suitable version "1.6.7", minimum required is "1.6.6") 
-- The following features have been enabled:

 * HDF5 (required version >= 1.8), parallel

-- The following OPTIONAL packages have been found:

 * FFTW3
 * HDF5 (required version >= 1.8), parallel
 * PkgConfig
 * GSL
 * Doxygen
 * Sphinx (required version >= 1.6.6)
 * Git

-- The following REQUIRED packages have been found:

 * CUDA (required version >= 9.0)
 * CUDACompilerNVCC (required version >= 9.0)
 * PythonInterp (required version >= 3.5)
 * Cython (required version >= 0.26)
 * PythonHeaders
 * NumPy
 * BLAS
 * Threads
 * LAPACK
 * MPI (required version >= 3.0)
 * Boost (required version >= 1.65.0)

-- Configuring done
-- Generating done
-- Build files have been written to: /ssd/fweik/espresso/cmake-build-release

[Finished]

The GPU is a NVidia GeForce RTX 2080.

@jngrad
Copy link
Member

jngrad commented Jul 27, 2020

Already mentioned in #3445 (comment), I forgot to open a ticket. IIRC it works by tweaking the CUDA include paths, but can't remember the exact procedure. It probably involved my custom CUDA library with the patched thrust version. SD on GPU is experimental until this is sorted out.

STOKESIAN_DYNAMICS_GPU requires EXPERIMENTAL_FEATURES

@fweik
Copy link
Contributor Author

fweik commented Jul 27, 2020

If it is experimental it should be opt-in and not turned on by default in maxset.

@fweik
Copy link
Contributor Author

fweik commented Jul 28, 2020

Also related, why is STOKESIAN_DYNAMICS configured via a cmake option, and not via myconfig?

@jngrad
Copy link
Member

jngrad commented Jul 28, 2020

  • maxset enables EXPERIMENTAL_FEATURES
  • if @hmenke doesn't find a fix by the time we release 4.2.0, we will have to reconsider including the SD GPU code in 4.2.0
  • CMake doesn't read the myconfig.hpp file to avoid introducing coupling; CMake only writes external features to the file (didn't we agree on that solution in the June 8th core dev meeting?)

@fweik
Copy link
Contributor Author

fweik commented Jul 28, 2020

I can't see why this should be treated differently. I think a better solution would be to have an external feature for the library,
and regular espresso features, configured in the usual way, that require the external features. Otherwise there is no way of
disabling the feature without meddling with the build system?

@mkuron
Copy link
Member

mkuron commented Jul 28, 2020

why is STOKESIAN_DYNAMICS configured via a cmake option, and not via myconfig?

@jgrad decided to have WITH_STOKESIAN_DYNAMICS instead of WITH_LAPACK and WITH_BLAS. The latter would have been more consistent with e.g. WITH_HDF5 and WITH_CUDA, but SD is the only feature that needs LAPACK/BLAS.

@hmenke
Copy link
Member

hmenke commented Jul 28, 2020

git blame gives 7078940

@mkuron
Copy link
Member

mkuron commented Jul 28, 2020

git blame gives 7078940

The question is why does the result differ between a GTX 1080 and an RTX 2080. Is it just slightly different numerical precision and the error bound is too tight, or is it an actual bug.

@jngrad
Copy link
Member

jngrad commented Jul 28, 2020

This commit is quite long, and some of the changes (e.g. the time step) were reverted later. I don't think exploring this commit further would help. On my workstation (GeForce RTX 2080), I get the GPU SD tests to pass with CUDA 10.0 patched with thrust 1.9.5, using either nvcc or clang9 as the compiler. The test fails with the default installation of CUDA 10.0, which ships thrust 1.9.3.

@RudolfWeeber
Copy link
Contributor

My take on the offline discussion at the Esprseso meeting:
We can only live with the failing test very temporarily. Since this is confusing and distracting to people doing unrelated work on Espresso.
Unless a solution is found quickly, we will have to take out GPU support from the development branch.

Personally, I think this has to converge by mid-August. In the weeks around the Espresso school, new users will potentially download the code and run the test suite. By that time, the issue needs to be gone one way or the other.

@hmenke
Copy link
Member

hmenke commented Jul 29, 2020

Well, I initially suggested pinning the Thrust version by including it as a submodule which would have also saved many hours that were wasted on the thrust_wrapper shenanigans, but that was immediately shot down because of “muh dependencies”.

@mkuron
Copy link
Member

mkuron commented Jul 29, 2020

I get the GPU SD tests to pass with CUDA 10.0 patched with thrust 1.9.5, using either nvcc or clang9 as the compiler. The test fails with the default installation of CUDA 10.0, which ships thrust 1.9.3.

Good find. Should be easy enough to bisect NVIDIA/thrust@1.9.3...1.9.5 to find the cause.

@mkuron
Copy link
Member

mkuron commented Jul 29, 2020

pinning the Thrust version by including it as a submodule

That wouldn't have worked as Thrust and CUDA versions are not arbitrarily upward and downward compatible.

bisect thrust/[email protected]

CUDA 9.1.85 with Thrust 1.9.2, 1.9.3, 1.9.5 and 1.9.7 exhibits the error.
CUDA 10.0.130 with Thrust 1.9.2, 1.9.3 and 1.9.5 does not.
CUDA 10.0.130 even works with the Thrust version it ships with (which I guess is some pre-release version of 1.9.3).

Here is my git-bisect-compatible build script:

#!/bin/bash

cd ~/Documents/espresso/build || exit 192
rm -rf * || exit 192
cmake .. \
  -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.0 -DCUDA_NVCC_EXECUTABLE=/usr/local/cuda-10.0/bin/nvcc \
  -DWITH_CUDA=ON -DCMAKE_CXX_FLAGS="-I/tikhome/mkuron/Documents/espresso/thrust" \
  -DCUDA_NVCC_FLAGS="-I/tikhome/mkuron/Documents/espresso/thrust" || exit 192
cp ../maintainer/configs/maxset.hpp myconfig.hpp || exit 192
cmake . || exit 192
make -j 24 VERBOSE=1 || exit 125
make python_test_data || exit 192

./pypresso ../testsuite/python/stokesian_dynamics_gpu.py

The way I see it, it's not related to Thrust version, but CUDA version. Disable building Stokesian Dynamics on CUDA less than 10 and we're done.

@jngrad
Copy link
Member

jngrad commented Jul 29, 2020

The test fails with the default installation of CUDA 10.0, which ships thrust 1.9.3.

I need to amend this statement: there is no error with CUDA 10.0. I misread my CMake log output. The test fails with CUDA 9, either 9.0 from the ICP or CUDA 9.1.85 from the docker image, when the hardware is a GeForce RTX 2080.

The way I see it, it's not related to Thrust version, but CUDA version.

Indeed, thrust is likely not at fault here. However, it depends also on the hardware. The test runs fine in CI on CUDA 9.

Disable building Stokesian Dynamics on CUDA less than 10 and we're done.

This requires making the CMake logic for the STOKESIAN_DYNAMICS and STOKESIAN_DYNAMICS_GPU even more convoluted than it already is. I don't think we can selectively disable SD GPU if CUDA is 9, because we need to be able to generate a warning if the user explicitly requested -DWITH_STOKESIAN_DYNAMICS=ON. I have already invested enough time in designing the current CMake logic for SD and got it approved by the core team. I won't be able to invest time in designing a new one.

Disabling SD GPU for CUDA 9 is also problematic, because we have committed ourselves to supporting CUDA 9 in bugfix releases until October 2021 (target release date for espresso 4.3). We already disabled SD GPU on ROCm in CI because it doesn't run, and disabled the diffusion test on all GPU jobs due to timeouts. Other GPU features of espresso do not have this special treatment.

@mkuron
Copy link
Member

mkuron commented Jul 29, 2020

The test runs fine in CI on CUDA 9.

Ah, I forgot about that. So in summary, it only fails when you use CUDA 9 with a Turing GPU. According to Nvidia's documentation, newer GPUs support code compiled with older CUDA versions (Turing was released after CUDA 9.1), but here they clearly don't. We've actually had a similar case before (#1412 (comment)), but the trick employed back then (-gencode=arch=compute_70,code=sm_70) does not help here.

I've tried comparing the generated PTX, but that's futile because so much changed between versions.

/usr/local/cuda-10.0/bin/nvcc _deps/stokesian_dynamics-src/src/sd_gpu.cu --ptx -o ptx10.txt -Dsd_gpu_EXPORTS -DSD_USE_THRUST  -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++14 -O3 -I/usr/include -I_deps/stokesian_dynamics-src/include
/usr/bin/nvcc _deps/stokesian_dynamics-src/src/sd_gpu.cu --ptx -o ptx9.txt -Dsd_gpu_EXPORTS -DSD_USE_THRUST  -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++14 -O3 -I/usr/include -I_deps/stokesian_dynamics-src/include
diff -u ptx9.txt ptx10.txt 

kodiakhq bot added a commit that referenced this issue Jul 30, 2020
Disable Stokesian Dynamics on GPU until the build system issue is sorted out and the GPU code passes CI on all platforms, as discussed in the [2020-07-28 ESPResSo meeting](https://github.com/espressomd/espresso/wiki/Espresso-meeting-2020-07-28) to avoid failing espresso builds (#3836) and failing python tests (#3826).
@KaiSzuttor
Copy link
Member

why is all the SD dependency management happening on our side? This should be handled by the cmake of the external library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants