Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

CUDA 10 w/ cuDNN 7.5 Support #14652

Closed
perdasilva opened this issue Apr 9, 2019 · 15 comments
Closed

CUDA 10 w/ cuDNN 7.5 Support #14652

perdasilva opened this issue Apr 9, 2019 · 15 comments

Comments

@perdasilva
Copy link
Contributor

Description

Currently, the CI tests fail when running mxnet on top of CUDA 10 and cuDNN 7.5 as demonstrated in this PR.

The tests pass when using CUDA 10 and cuDNN 7.3.1.20, as demonstrated in this PR.

Environment info (Required)

g3.8xlarge with CUDA 10 and nvidia driver 410.73 installed.
The code is running inside the CI GPU container based on nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04.

Error Message:

Usually: src/operator/./cudnn_rnn-inl.h:759: Check failed: e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH

Here are some example logs:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/

Steps to reproduce

# Launch g3.8xlarge instance with ubuntu 16.04

# ==-_-==-_-== Environment Setup ==-_-==-_-==

sudo apt update
sudo apt-get install -y \
    apt-transport-https \
    build-essential \
    ca-certificates \
    curl \
    git \
    libatlas-base-dev \
    libcurl4-openssl-dev \
    libjemalloc-dev \
    libhdf5-dev \
    liblapack-dev \
    libopenblas-dev \
    libopencv-dev \
    libturbojpeg \
    libzmq3-dev \
    ninja-build \
    software-properties-common \
    sudo \
    unzip \
    wget

sudo apt-get install -y python-dev python3-dev virtualenv wget

# the version of the pip shipped with ubuntu may be too lower, install a recent version here
wget -nv https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo python2 get-pip.py

pip2 install --user nose cpplint==1.3.0 pylint==1.9.3 'numpy<=1.15.2,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3
pip3 install --user nose cpplint==1.3.0 pylint==2.1.1 'numpy<=1.15.2,>=1.8.2' nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3

# ==-_-==-_-== CUDA Installation ==-_-==-_-==

wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
chmod +x cuda_10.0.130_410.48_linux && sudo ./cuda_10.0.130_410.48_linux

# Installation except:
# Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
# (y)es/(n)o/(q)uit: y
# 
# Do you want to install the OpenGL libraries?
# (y)es/(n)o/(q)uit [ default is yes ]:
#
# Do you want to run nvidia-xconfig?
# This will update the system X configuration file so that the NVIDIA X driver
# is used. The pre-existing X configuration file will be backed up.
# This option should not be used on systems that require a custom
# X configuration, such as systems with multiple GPU vendors.
# (y)es/(n)o/(q)uit [ default is no ]:
# 
# Install the CUDA 10.0 Toolkit?
# (y)es/(n)o/(q)uit: y
#
# Enter Toolkit Location
# [ default is /usr/local/cuda-10.0 ]:
#
# Do you want to install a symbolic link at /usr/local/cuda?
# (y)es/(n)o/(q)uit: y
#
# Install the CUDA 10.0 Samples?
# (y)es/(n)o/(q)uit: n

# Set LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}

# Check installation
nvidia-smi

# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
# |===============================+======================+======================|
# |   0  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |
# | N/A   31C    P0    43W / 150W |      0MiB /  7618MiB |      0%      Default |
# +-------------------------------+----------------------+----------------------+
# |   1  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
# | N/A   34C    P0    41W / 150W |      0MiB /  7618MiB |     99%      Default |
# +-------------------------------+----------------------+----------------------+
#
# +-----------------------------------------------------------------------------+
# | Processes:                                                       GPU Memory |
# |  GPU       PID   Type   Process name                             Usage      |
# |=============================================================================|
# |  No running processes found                                                 |
# +-----------------------------------------------------------------------------+

# ==-_-==-_-== Setup cuDNN ==-_-==-_-==

# https://developer.nvidia.com/rdp/cudnn-download
# Register with NVIDIA and download cudnn-10.0-linux-x64-v7.5.0.56.tgz
# scp it to your instance
# https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html
tar -xvzf cudnn-10.0-linux-x64-v7.5.0.56.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

# ==-_-==-_-== Clone MXNet Repo. ==-_-==-_-==
mkdir -p repositories/apache && cd repositories/apache
git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet

# ==-_-==-_-== Compile MXNet ==-_-==-_-==
make \
        DEV=1                                     \
        ENABLE_TESTCOVERAGE=1                     \
        USE_BLAS=openblas                         \
        USE_MKLDNN=0                              \
        USE_CUDA=1                                \
        USE_CUDA_PATH=/usr/local/cuda             \
        USE_CUDNN=1                               \
        USE_CPP_PACKAGE=0                         \
        USE_DIST_KVSTORE=1                        \
        USE_SIGNAL_HANDLER=1                      \
        -j$(nproc)

# ==-_-==-_-== Run failing test ==-_-==-_-==
export PYTHONPATH=./python/                                                                                        
nosetests-3.4 --verbose tests/python/gpu/test_gluon_gpu.py:test_rnn_layers_fp16

# Error excerpt:
# ======================================================================
# ERROR: test_gluon_gpu.test_rnn_layers_fp16
# ----------------------------------------------------------------------
# Traceback (most recent call last):
#   File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
#     self.test(*self.arg)
#   File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
#     return func(*arg, **kw)
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new
#     orig_test(*args, **kwargs)
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 545, in test_rnn_layers_fp16
#     run_rnn_layers('float16', 'float32', mx.gpu())
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 479, in run_rnn_layers
#     check_rnn_layer_forward(gluon.rnn.RNN(10, 2, dtype=dtype), mx.nd.ones((8, 3, 20), dtype=dtype), ctx=ctx)
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/tests/python/gpu/../unittest/test_gluon_rnn.py", line 451, in check_rnn_layer_forward
#     np_out = out.asnumpy()
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
#     ctypes.c_size_t(data.size)))
#   File "/home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
#     raise MXNetError(py_str(_LIB.MXGetLastError()))
# mxnet.base.MXNetError: [07:41:30] src/operator/./cudnn_rnn-inl.h:759: Check failed: e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH
# 
# Stack trace returned 10 entries:
# [bt] (0) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1c7) [0x7fe8ec2eebd7]
# [bt] (1) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fe8ec2ef082]
# [bt] (2) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Init(mshadow::Stream<mshadow::gpu>*, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x333c) [0x7fe8f36f8afc]
# [bt] (3) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNRNNOp<mshadow::half::half_t>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x1501) [0x7fe8f3700c61]
# [bt] (4) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x48b) [0x7fe8ef82dd5b]
# [bt] (5) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::LegacyOpForward(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x18) [0x7fe8ef820838]
# [bt] (6) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&), void (*)(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)>::_M_invoke(std::_Any_data const&, mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x20) [0x7fe8ef5d9250]
# [bt] (7) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x2e8) [0x7fe8ef8d7e88]
# [bt] (8) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, # mxnet::RunContext&&)+0x25) [0x7fe8ef8d8215]
# [bt] (9) /home/ubuntu/repositories/apache/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x5f9056e) [0x7fe8f02a656e]
# 
# 
# -------------------- >> begin captured logging << --------------------
# common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1716277661 to reproduce.
# --------------------- >> end captured logging << ---------------------
# 
# ----------------------------------------------------------------------
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Cuda

@perdasilva
Copy link
Contributor Author

My suggestion would be to maybe merge the PR with cuDNN v7.3.1.20 (which would at least ensure that mxnet works with cuDNN up to this version), then whoever tackles the v7.5 issue can just update the CI image to use the latest version of cuDNN.

@sl1pkn07
Copy link
Contributor

sl1pkn07 commented Apr 9, 2019

only for to know

the latest version of cuda is 10.1.105_418.39 and cudnn 10.1-linux-x64-v7.5.0.56

why not use this version?

greetings

@perdasilva
Copy link
Contributor Author

Hey,

This requires a bit more work on the AMI side. I'm also no convinced that it will solve the problem.
Once we get on cuDNN 7.5, we can look at updating the AMIs to CUDA 10.1 and then bumping the CI images.

Cheers

@stu1130
Copy link
Contributor

stu1130 commented Apr 9, 2019

Hey @perdasilva
I can tackle the update v7.3.1.20 -> v7.5 and then bump up to CUDA 10.1 if you decide to merge PR first
Thanks

@vrakesh
Copy link
Contributor

vrakesh commented Apr 9, 2019

@mxnet-label-bot add [CUDA, CI]

@perdasilva
Copy link
Contributor Author

@stu1130 thank you. It seems that the nvidia drivers on the linux nodes has been bumped to 418 because of the tensorrt issues. This means we should be able to use CUDA 10.1 =) (let me know if it doesn't work)

@stu1130
Copy link
Contributor

stu1130 commented Apr 29, 2019

@perdasilva any updates? Thanks

@perdasilva
Copy link
Contributor Author

@stu1130 I'm currently on leave until Thursday. I totally missed that you wanted me to merge the other PR first. I will do that as soon as I'm back. I'm sorry missed that. I'll see about already bumping CI to 10.1 as well - then that's done.

@stu1130
Copy link
Contributor

stu1130 commented Apr 30, 2019

@perdasilva no rush! Thanks a lot for this awesome job!!!

@perdasilva
Copy link
Contributor Author

@stu1130 There's no cudnn 7.3 package for cuda 10.1, so I won't be able to update CI to 10.1 in my PR.
I've just done a rebase and I'm putting it through CI =D I'll let you know once it's through.

@perdasilva
Copy link
Contributor Author

@stu1130 it's been merged! Feel free to take it away and let me know if I can help you =)

@stu1130
Copy link
Contributor

stu1130 commented May 2, 2019

@perdasilva Awesome Thanks a lot!!!

@stu1130
Copy link
Contributor

stu1130 commented May 9, 2019

Here are what I found

  1. The unit test failed on cuDNN 7.5.0 & 7.5.1 but work perfectly on 7.4.2 & 7.3.1. And using Tesla V100 will resolve this problem i.e. work fine on cuDNN 7.3.1 & 7.5.0 & 7.5.1.
  2. The function that causes the error is cudnnGetRNNWorkspaceSize in here
    https://github.com/apache/incubator-mxnet/blob/874fb89cd33b0e4affd7f3fb1b4ae4e09f25ef84/src/operator/rnn-inl.h#L1369
  3. I tried CUDA 10.1 with latest CUDA driver 418.67 still not working

@perdasilva
Copy link
Contributor Author

This has since been fixed ^^ thx to @stu1130

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants