Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080 #1186

Closed
xhuvom opened this issue Oct 19, 2016 · 6 comments
Labels

Comments

@xhuvom
Copy link

xhuvom commented Oct 19, 2016

I have installed Digits (v5.1-dev) (as described) and running a local devserver on Ubuntu 14.04 backed by GTX-1080. The caffe [version 0.15.14]
building works fine with Cuda compilation tools, release 8.0, V8.0.44 with cuDNN (ver. 5.1.5) and the NVIDIA drivers (Driver Version: 367.44) installed properly. But the training attempt on a DetectNet model stops suddenly following error:

    Initialized at 10:49:11 PM (1 second)
    Running at 10:49:12 PM (38 seconds)
    Error at 10:49:51 PM
    (Total - 40 seconds)

ERROR: Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE

This network produces output mAP
This network produces output precision
This network produces output recall
Network initialization done.
Solver scaffolding done.
Starting Optimization
Solving
Learning Rate Policy: step
Iteration 0, Testing net (#0)
Ignoring source layer train_data
Ignoring source layer train_label
Ignoring source layer train_transform
Data layer prefetch queue empty
Data layer prefetch queue empty
Test net output #0: loss_bbox = 18.0172 (* 2 = 36.0345 loss)
Test net output #1: loss_coverage = 163.255 (* 1 = 163.255 loss)
Test net output #2: mAP = 0
Test net output #3: precision = 0
Test net output #4: recall = 0
Check failed: status == CURAND_STATUS_

My python version is 2.7.6 and caffe version is 0.15.14. The caffe Makefile.config is tuned as follows:
USE_CUDNN := 1 PYTHON_LIB := /usr/lib

I am back to Ubuntu 14.04 since no rigid official documentation for Ubuntu 16.04 and unavailability of the CUDA 8.0 Pascal support. How could I run a proper training job on DIGITS in my machine? Should I get back the CUDA 8.0RC or anything else?? Requesting some suggestions. Thanks in advance.

@xhuvom xhuvom changed the title Digits 5.1 DetectNet training error for CUDA 8.0 in Ubuntu 14.04 backed by GTX-1080 Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080 Oct 19, 2016
@xhuvom
Copy link
Author

xhuvom commented Oct 20, 2016

@Gnurou @tmjbradley @swarren @jholewinski @cubanismo @3XX0 @flx42 @lukeyeager @all_

@flx42
Copy link
Member

flx42 commented Oct 20, 2016

Please don't tag random developers from the NVIDIA organization, it won't get you an answer any faster.

@lukeyeager
Copy link
Member

Can you please give me the following information:

  1. Which packages have you installed?
# for example, this is what comes installed on the nvidia/digits docker image
$ dpkg -l | egrep 'digits|caffe|libcudnn|libnccl|cudart|nvidia'
ii  caffe-nv                           0.15.13-1+cuda7.5                       amd64        Fast open framework for Deep Learning
ii  caffe-nv-tools                     0.15.13-1+cuda7.5                       amd64        Fast open framework for Deep Learning (Tools)
ii  cuda-cudart-7-5                    7.5-18                                  amd64        CUDA Runtime native Libraries
ii  digits                             4.0.0-1                                 amd64        NVIDIA DIGITS webserver
ii  libcaffe-nv0                       0.15.13-1+cuda7.5                       amd64        Fast open framework for Deep Learning (Libs)
ii  libcudnn5                          5.1.3-1+cuda7.5                         amd64        cuDNN runtime libraries
ii  libnccl1                           1.2.3-1+cuda7.5                         amd64        NVIDIA Collectives Communication Library (NCCL) Runtime
ii  python-caffe-nv                    0.15.13-1+cuda7.5                       amd64        Fast open framework for Deep Learning (Python)
  1. What does nvidia-smi show?
$ nvidia-smi
Thu Oct 20 09:46:46 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 0000:01:00.0      On |                  Off |
| 38%   36C    P8    17W / 150W |    547MiB /  8120MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:02:00.0     Off |                  N/A |
| 22%   33C    P8    16W / 250W |    116MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1650    G   /usr/lib/xorg/Xorg                             298MiB |
|    0      2664    G   compiz                                         242MiB |
|    0      4312    G   /usr/lib/firefox/firefox                         2MiB |
|    1     31049    C   /usr/bin/python                                112MiB |
+-----------------------------------------------------------------------------+

@xhuvom
Copy link
Author

xhuvom commented Oct 20, 2016

The query actually solved the problem since my setup messed with both CUDA 7.5 and 8.0 installation. Purging the CUDA 8.0 setup solved the problem and the training working like a charm.
My current query returns:
1.

ii  cuda-cudart-8-0                                       8.0.44-1                                            amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-8-0                                   8.0.44-1                                            amd64        CUDA Runtime native dev links, headers
ii  nvidia-367                                            367.57-0ubuntu0~gpu14.04.1                          amd64        NVIDIA binary driver - version 367.57
ii  nvidia-367-dev                                        367.57-0ubuntu0~gpu14.04.1                          amd64        NVIDIA binary Xorg driver development files
ii  nvidia-modprobe                                       367.48-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-367                                 367.57-0ubuntu0~gpu14.04.1                          amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                          0.6.2                                               amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                       370.28-0ubuntu0~gpu14.04.1                          amd64        Tool for configuring the NVIDIA graphics driver
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 0000:01:00.0      On |                  N/A |
| 45%   77C    P2   193W / 270W |   5818MiB /  8110MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1450    G   /usr/lib/xorg/Xorg                             154MiB |
|    0      2051    G   compiz                                          79MiB |
|    0      9245    C   python                                         113MiB |
|    0      9391    C   /home/xhuv/caffe/build/tools/caffe            5467MiB |
+-----------------------------------------------------------------------------+

Thanks for your helpful concern. :)

@xmyqsh
Copy link

xmyqsh commented Oct 27, 2016

I've solved this problem in another way.
Just remove cuda-7.5's path from Makefile.config will be OK.

#---------------------------------

@lukeyeager @xhuvom I have the similar problem when I run make runtest
curandCreateGenerator and curandSetPseudoRandomGeneratorSeed cannot return CURAND_STATUS_SUCCESS

Ubuntu 16.04.04 GTX-1080 CUDA 8.0
Is my cuda-8.0 setting unsuccessful ?
here is my return info:

ii  libcudart7.5:amd64                         7.5.18-0ubuntu1                               amd64        NVIDIA CUDA Runtime Library
ii  nvidia-367                                 367.44-0ubuntu0.16.04.2                       amd64        NVIDIA binary driver - version 367.44
ii  nvidia-cuda-dev                            7.5.18-0ubuntu1                               amd64        NVIDIA CUDA development files
ii  nvidia-cuda-doc                            7.5.18-0ubuntu1                               all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                            7.5.18-0ubuntu1                               amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                        7.5.18-0ubuntu1                               amd64        NVIDIA CUDA development toolkit
ii  nvidia-opencl-dev:amd64                    7.5.18-0ubuntu1                               amd64        NVIDIA OpenCL development files
ii  nvidia-opencl-icd-367                      367.44-0ubuntu0.16.04.2                       amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                               0.8.2                                         amd64        Tools to enable NVIDIA's Prime
ii  nvidia-profiler                            7.5.18-0ubuntu1                               amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                            370.28-0ubuntu0~gpu16.04.1                    amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-visual-profiler                     7.5.18-0ubuntu1                               amd64        NVIDIA Visual Profiler for CUDA and OpenCL
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.44                 Driver Version: 367.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:01:00.0      On |                  N/A |
|  0%   29C    P8     9W / 180W |    209MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       981    G   /usr/lib/xorg/Xorg                             110MiB |
|    0      1426    G   compiz                                          96MiB |
+-----------------------------------------------------------------------------+

@lukeyeager
Copy link
Member

You clearly don't have CUDA 8.0 installed, you have CUDA 7.5 (see 7.5.18-0ubuntu1). Also, you're installing Canonical's packages which aren't supported by NVIDIA (as far as I know).

Please follow these instructions to install CUDA (here's the download site). You'll probably also need to purge all the nvidia-cuda-* packages to make way for the cuda-*-8-0 packages.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants