Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080 #1186

xhuvom · 2016-10-19T17:21:38Z

I have installed Digits (v5.1-dev) (as described) and running a local devserver on Ubuntu 14.04 backed by GTX-1080. The caffe [version 0.15.14]
building works fine with Cuda compilation tools, release 8.0, V8.0.44 with cuDNN (ver. 5.1.5) and the NVIDIA drivers (Driver Version: 367.44) installed properly. But the training attempt on a DetectNet model stops suddenly following error:

    Initialized at 10:49:11 PM (1 second)
    Running at 10:49:12 PM (38 seconds)
    Error at 10:49:51 PM
    (Total - 40 seconds)

ERROR: Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE

This network produces output mAP
This network produces output precision
This network produces output recall
Network initialization done.
Solver scaffolding done.
Starting Optimization
Solving
Learning Rate Policy: step
Iteration 0, Testing net (#0)
Ignoring source layer train_data
Ignoring source layer train_label
Ignoring source layer train_transform
Data layer prefetch queue empty
Data layer prefetch queue empty
Test net output #0: loss_bbox = 18.0172 (* 2 = 36.0345 loss)
Test net output #1: loss_coverage = 163.255 (* 1 = 163.255 loss)
Test net output #2: mAP = 0
Test net output #3: precision = 0
Test net output #4: recall = 0
Check failed: status == CURAND_STATUS_

My python version is 2.7.6 and caffe version is 0.15.14. The caffe Makefile.config is tuned as follows:
USE_CUDNN := 1 PYTHON_LIB := /usr/lib

I am back to Ubuntu 14.04 since no rigid official documentation for Ubuntu 16.04 and unavailability of the CUDA 8.0 Pascal support. How could I run a proper training job on DIGITS in my machine? Should I get back the CUDA 8.0RC or anything else?? Requesting some suggestions. Thanks in advance.

The text was updated successfully, but these errors were encountered:

xhuvom · 2016-10-20T11:24:04Z

@Gnurou @tmjbradley @swarren @jholewinski @cubanismo @3XX0 @flx42 @lukeyeager @all_

flx42 · 2016-10-20T16:34:11Z

Please don't tag random developers from the NVIDIA organization, it won't get you an answer any faster.

lukeyeager · 2016-10-20T16:47:28Z

Can you please give me the following information:

Which packages have you installed?

# for example, this is what comes installed on the nvidia/digits docker image
$ dpkg -l | egrep 'digits|caffe|libcudnn|libnccl|cudart|nvidia'
ii  caffe-nv                           0.15.13-1+cuda7.5                       amd64        Fast open framework for Deep Learning
ii  caffe-nv-tools                     0.15.13-1+cuda7.5                       amd64        Fast open framework for Deep Learning (Tools)
ii  cuda-cudart-7-5                    7.5-18                                  amd64        CUDA Runtime native Libraries
ii  digits                             4.0.0-1                                 amd64        NVIDIA DIGITS webserver
ii  libcaffe-nv0                       0.15.13-1+cuda7.5                       amd64        Fast open framework for Deep Learning (Libs)
ii  libcudnn5                          5.1.3-1+cuda7.5                         amd64        cuDNN runtime libraries
ii  libnccl1                           1.2.3-1+cuda7.5                         amd64        NVIDIA Collectives Communication Library (NCCL) Runtime
ii  python-caffe-nv                    0.15.13-1+cuda7.5                       amd64        Fast open framework for Deep Learning (Python)

What does nvidia-smi show?

$ nvidia-smi
Thu Oct 20 09:46:46 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 0000:01:00.0      On |                  Off |
| 38%   36C    P8    17W / 150W |    547MiB /  8120MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:02:00.0     Off |                  N/A |
| 22%   33C    P8    16W / 250W |    116MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1650    G   /usr/lib/xorg/Xorg                             298MiB |
|    0      2664    G   compiz                                         242MiB |
|    0      4312    G   /usr/lib/firefox/firefox                         2MiB |
|    1     31049    C   /usr/bin/python                                112MiB |
+-----------------------------------------------------------------------------+

xhuvom · 2016-10-20T20:06:35Z

The query actually solved the problem since my setup messed with both CUDA 7.5 and 8.0 installation. Purging the CUDA 8.0 setup solved the problem and the training working like a charm.
My current query returns:
1.

ii  cuda-cudart-8-0                                       8.0.44-1                                            amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-8-0                                   8.0.44-1                                            amd64        CUDA Runtime native dev links, headers
ii  nvidia-367                                            367.57-0ubuntu0~gpu14.04.1                          amd64        NVIDIA binary driver - version 367.57
ii  nvidia-367-dev                                        367.57-0ubuntu0~gpu14.04.1                          amd64        NVIDIA binary Xorg driver development files
ii  nvidia-modprobe                                       367.48-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-367                                 367.57-0ubuntu0~gpu14.04.1                          amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                          0.6.2                                               amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                       370.28-0ubuntu0~gpu14.04.1                          amd64        Tool for configuring the NVIDIA graphics driver

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 0000:01:00.0      On |                  N/A |
| 45%   77C    P2   193W / 270W |   5818MiB /  8110MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1450    G   /usr/lib/xorg/Xorg                             154MiB |
|    0      2051    G   compiz                                          79MiB |
|    0      9245    C   python                                         113MiB |
|    0      9391    C   /home/xhuv/caffe/build/tools/caffe            5467MiB |
+-----------------------------------------------------------------------------+

Thanks for your helpful concern. :)

xmyqsh · 2016-10-27T05:36:02Z

I've solved this problem in another way.
Just remove cuda-7.5's path from Makefile.config will be OK.

#---------------------------------

@lukeyeager @xhuvom I have the similar problem when I run make runtest
curandCreateGenerator and curandSetPseudoRandomGeneratorSeed cannot return CURAND_STATUS_SUCCESS

Ubuntu 16.04.04 GTX-1080 CUDA 8.0
Is my cuda-8.0 setting unsuccessful ?
here is my return info:

ii  libcudart7.5:amd64                         7.5.18-0ubuntu1                               amd64        NVIDIA CUDA Runtime Library
ii  nvidia-367                                 367.44-0ubuntu0.16.04.2                       amd64        NVIDIA binary driver - version 367.44
ii  nvidia-cuda-dev                            7.5.18-0ubuntu1                               amd64        NVIDIA CUDA development files
ii  nvidia-cuda-doc                            7.5.18-0ubuntu1                               all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                            7.5.18-0ubuntu1                               amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                        7.5.18-0ubuntu1                               amd64        NVIDIA CUDA development toolkit
ii  nvidia-opencl-dev:amd64                    7.5.18-0ubuntu1                               amd64        NVIDIA OpenCL development files
ii  nvidia-opencl-icd-367                      367.44-0ubuntu0.16.04.2                       amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                               0.8.2                                         amd64        Tools to enable NVIDIA's Prime
ii  nvidia-profiler                            7.5.18-0ubuntu1                               amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                            370.28-0ubuntu0~gpu16.04.1                    amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-visual-profiler                     7.5.18-0ubuntu1                               amd64        NVIDIA Visual Profiler for CUDA and OpenCL

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.44                 Driver Version: 367.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:01:00.0      On |                  N/A |
|  0%   29C    P8     9W / 180W |    209MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       981    G   /usr/lib/xorg/Xorg                             110MiB |
|    0      1426    G   compiz                                          96MiB |
+-----------------------------------------------------------------------------+

lukeyeager · 2016-10-27T16:35:31Z

You clearly don't have CUDA 8.0 installed, you have CUDA 7.5 (see 7.5.18-0ubuntu1). Also, you're installing Canonical's packages which aren't supported by NVIDIA (as far as I know).

Please follow these instructions to install CUDA (here's the download site). You'll probably also need to purge all the nvidia-cuda-* packages to make way for the cuda-*-8-0 packages.

xhuvom changed the title ~~Digits 5.1 DetectNet training error for CUDA 8.0 in Ubuntu 14.04 backed by GTX-1080~~ Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080 Oct 19, 2016

This was referenced Oct 20, 2016

cuda problem when training the model BVLC/caffe#2417

Closed

panic: CURAND_STATUS_LAUNCH_FAILURE on GTX1080 mumax/3#77

Closed

lukeyeager added the question label Oct 20, 2016

lukeyeager mentioned this issue Oct 20, 2016

DIGITS running from nvidia-docker gives "ERROR: Check failed:" for AlexNet model with CUDA 8, cuDNN 5.1 NVIDIA/nvidia-docker#221

Closed

lukeyeager closed this as completed Oct 20, 2016

lukeyeager mentioned this issue Nov 18, 2016

Pascal boards: CURAND_STATUS_LAUNCH_FAILURE NVIDIA/caffe#270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080 #1186

Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080 #1186

xhuvom commented Oct 19, 2016 •

edited by lukeyeager

Loading

xhuvom commented Oct 20, 2016 •

edited

Loading

flx42 commented Oct 20, 2016

lukeyeager commented Oct 20, 2016

xhuvom commented Oct 20, 2016 •

edited by lukeyeager

Loading

xmyqsh commented Oct 27, 2016 •

edited

Loading

lukeyeager commented Oct 27, 2016

Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080 #1186

Digits 5.1 DetectNet training error for CUDA 8.0 on Ubuntu 14.04 backed by GTX-1080 #1186

Comments

xhuvom commented Oct 19, 2016 • edited by lukeyeager Loading

xhuvom commented Oct 20, 2016 • edited Loading

flx42 commented Oct 20, 2016

lukeyeager commented Oct 20, 2016

xhuvom commented Oct 20, 2016 • edited by lukeyeager Loading

xmyqsh commented Oct 27, 2016 • edited Loading

#---------------------------------

lukeyeager commented Oct 27, 2016

xhuvom commented Oct 19, 2016 •

edited by lukeyeager

Loading

xhuvom commented Oct 20, 2016 •

edited

Loading

xhuvom commented Oct 20, 2016 •

edited by lukeyeager

Loading

xmyqsh commented Oct 27, 2016 •

edited

Loading