-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Test Failure] GPU Test failures across different CUDA versions #14502
Comments
Hey, this is the MXNet Label Bot. |
@mxnet-label-bot add [CUDA, Docker, Bug] |
Perhaps the reason CI isn't reporting any issues is that both the version of cudnn in the environment that we set and the cudnn version set by the image are lower than what is required by the test. So, as it is, the test just ensures an error is raised. I don't know if (m)any of the cudnn functions are being tested right now. Maybe there are bugs slipping through. |
Created a PR (#14513) updating CI to use CUDA v10.0 and use the CUDNN_VERSION environment variable value set by the base container to see if I can catch this issue in CI. |
I've figured out why the test_lstmp test was failing and have submitted a PR (#14529) to fix that issue. I have also reached out to @DickJC123 to see if his cuda expertise can help with the other errors related to Here are some jenkins logs: |
Closing this issue in favor of #14652 |
Description
Testing mxnet library compiled for the python distribution against different versions of CUDA.
I'm getting a strange failures on all CUDA versions. The tests are being run on a g3.8xlarge instance, within a docker container based on the nvidia/cuda:XXX-cudnn7-devel-ubuntu16.04 (where XXX is the particular version of CUDA).
I have not yet tried to reproduce it separately outside of Docker on a GPU machine using the current pip package for 1.4.0 or nightly - could be that it's only an issue on master.
I find it strange that the PRs aren't breaking. Since they seem to be based off the same docker image I'm using, running on the same instance type.
The text was updated successfully, but these errors were encountered: