Minimum CUDA arch == compute capability 2.0? #12

reedscot · 2013-12-10T21:01:57Z

I tried running Caffe with Nvida GTX470 and GTX570 GPUs which have compute capability 2.0. While the MNIST demo worked, it failed on the ImageNet pipeline, giving the following CUDA-related error:

...
I1209 00:40:23.426077 21877 net.cpp:142] Network initialization done.
I1209 00:40:23.426111 21877 solver.cpp:36] Solver scaffolding done.
I1209 00:40:23.426146 21877 solver.cpp:44] Solving CaffeNet
F1209 00:40:23.521303 21877 relu_layer.cu:54] Cuda kernel failed. Error: invalid configuration argument
*** Check failure stack trace: ***
@ 0x7f9113749b5d google::LogMessage::Fail()
@ 0x7f911374db77 google::LogMessage::SendToLog()
@ 0x7f911374b9f9 google::LogMessage::Flush()
@ 0x7f911374bcfd google::LogMessageFatal::~LogMessageFatal()
@ 0x444ad5 caffe::ReLULayer<>::Forward_gpu()
@ 0x42a1ba caffe::Net<>::ForwardPrefilled()
@ 0x41d513 caffe::Solver<>::Solve()
@ 0x40b46d main
@ 0x3d8a01ecdd (unknown)
@ 0x40b2c9 (unknown)

When I try on an Nvidia Titan GPU (compute capability 3.5), it works fine. So I suspect Caffe may require compute capability 3.0 or higher.

mavenlin · 2013-12-11T00:42:40Z

@reedscot How is the speed for imagenet?

reedscot · 2013-12-11T01:27:18Z

On Nvidia Titan GPU it finishes 1000 iterations in around 10 minutes. By 'iterations' I am not sure whether it is passing through the entire training set or just a subset, but I just mean the 'iteration' that is displayed as output during training. However, on my machine it slows down quite a bit as the memory consumption inexorably grows to almost 100%. By ~5000 iterations it is basically stuck, possibly thrashing. So, I am wondering if there is a memory leak or some memory that should be freed each iteration that is not being freed. I observe the same thing when I set solver_mode to 0 or 1 (CPU or GPU). Other than this everything seems to work (I can complete MNIST training for example).

SWu · 2013-12-21T16:59:41Z

I ran into the same error as you did, but the issue isn't with compute 3+ functionality, but rather architecture limitations prior to compute 3.0. In particular, for large networks, you're running out of blocks per grid dim (compute 2.0 had only 65535 blocks per dim, while 3.0 bumped it to 2^31-1). This is easily remedied by making the grid 2d, which gives you 65535^2 total available blocks (or even 3d if so desired), and changing all the thread index computations to: int index = threadIdx.x + (blockIdx.x + blockIdx.y*gridDim.x) * blockDim.x;

After making this change, I ran into another error with insufficient number of registers for some max pooling layers, so I also had to reduce the num_threads_per_block to 512 from 1024.

For reference, I am running the imagenet architecture (with a few small tweaks) on a tesla m2090

kloudkl · 2014-01-13T06:19:17Z

The problem was also encountered on NVIDIA GeForce GTX 560 Ti with compute capability of 2.1. The error message "Cuda kernel failed. Error: invalid configuration argument" is the proof that the original problem was indeed caused by not generating PTX back-end target for the GPUs with compute capability less than 3.0.

It has been solved by commit b5badf7 "Add CUDA gencode for all 2x & 3x arch compute capability combinations".

SWu · 2014-01-14T15:01:58Z

Beyond generating CUDA 2x arch through the compiler switch though, the problem still remains that with large networks, i.e. the included imagenet sample, the above problems prevent the code from running since you will run out of block indices and registers.

everanurag · 2014-01-27T05:27:23Z

Hi

I have been trying to run caffe on imagenet with GTX 660Ti graphics card that has 3 GB of RAM and i am getting a cudamalloc error while allocating memory for layer params. Does this mean imagenet configuration cannot be supported on this hardware and i need to upgrade to 6 GB?

Alternatively, what would be the minimum GPU spec (RAM etc) for running imagenet configurations as provided in the package?

sguada · 2014-01-27T05:35:46Z

You can reduce the size of the batchs in the prototxt training and test
files, to reduce the memory requirements.

Sergio

2014-01-26 everanurag [email protected]

Hi

I have been trying to run caffe on imagenet with GTX 660Ti graphics card
that has 3 GB of RAM and i am getting a cudamalloc error while allocating
memory for layer params. Does this mean imagenet configuration cannot be
supported on this hardware and i need to upgrade to 6 GB?

Alternatively, what would be the minimum GPU spec (RAM etc) for running
imagenet configurations as provided in the package?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-33342872
.

everanurag · 2014-01-27T05:45:56Z

Just tried reducing the batchsize in prototxt file, still getting the following error, any throughts?

F0126 21:48:38.671995 6452 syncedmem.cpp:48] Check failed: (cudaMalloc(&gpu_ptr_, size_)) == cudaSuccess (38 vs. 0)
*** Check failure stack trace: ***
@ 0x7f0b88046b7d google::LogMessage::Fail()
@ 0x7f0b88048c7f google::LogMessage::SendToLog()
@ 0x7f0b8804676c google::LogMessage::Flush()
@ 0x7f0b8804951d google::LogMessageFatal::~LogMessageFatal()
@ 0x4335fc caffe::SyncedMemory::mutable_gpu_data()
@ 0x423512 caffe::Blob<>::mutable_gpu_data()
@ 0x460b91 caffe::DataLayer<>::Forward_gpu()
@ 0x42a3c2 caffe::Net<>::ForwardPrefilled()
@ 0x422380 caffe::Solver<>::Solve()
@ 0x40d265 main
@ 0x7f0b8670676d (unknown)
@ 0x40e51d (unknown)

6452 Aborted (core dumped) GLOG_logtostderr=1

Yangqing · 2014-01-27T05:52:31Z

cudaError_t value 38 means no cuda-capable device is available, so maybe
doublecheck your hardware / driver installation.

(For error codes, check driver_types.h)

Yangqing

On Sun, Jan 26, 2014 at 9:45 PM, everanurag [email protected]:

Just tried reducing the batchsize in prototxt file, still getting the
following error, any throughts?

F0126 21:48:38.671995 6452 syncedmem.cpp:48] Check failed:
(cudaMalloc(&gpu_ptr_, size_)) == cudaSuccess (38 vs. 0)
*** Check failure stack trace: ***
@ 0x7f0b88046b7d google::LogMessage::Fail()
@ 0x7f0b88048c7f google::LogMessage::SendToLog()
@ 0x7f0b8804676c google::LogMessage::Flush()
@ 0x7f0b8804951d google::LogMessageFatal::~LogMessageFatal()
@ 0x4335fc caffe::SyncedMemory::mutable_gpu_data()
@ 0x423512 caffe::Blob<>::mutable_gpu_data()
@ 0x460b91 caffe::DataLayer<>::Forward_gpu()
@ 0x42a3c2 caffe::Net<>::ForwardPrefilled()
@ 0x422380 caffe::Solver<>::Solve()
@ 0x40d265 main
@ 0x7f0b8670676d (unknown)
@ 0x40e51d (unknown)

6452 Aborted (core dumped) GLOG_logtostderr=1

Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-33343375
.

everanurag · 2014-01-27T06:04:38Z

It runs the mnist demo in GPU mode fine, so could this be due to large network in imagenet that needs GPU with more RAM (currently its 3 GB for me, GTX660Ti) ?

jamt9000 · 2014-03-10T13:34:45Z

This seems to still be a problem. I would like to help make caffe work well on CUDA compute capability 2.x devices for ImageNet scale configurations.

@SWu's workaround solves the block indexing problem, but there are some questions about how to implement it in practice, since it would require any kernel to account for the fact that the grid may be 2D.

The most straightforward way would be something like this:

Modify CAFFE_GET_BLOCKS to potentially return a 2D dim3

A way to compute the 2D block dimensions being:

int n = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS  
dim3 blocks(ceil(sqrt(n)), ceil(sqrt(n)))

Modify the expressions in the kernels for getting a 1D index
- perhaps by making a macro like CAFFE_GET_1D_INDEX()

However, there is probably a more principled way to account for the 2D structure in the first place, which would require more drastic rewriting of the kernels.

… of a grid of thread blocks is 65535. So the Kernel will crash when the CAFFE_GET_BLOCKS is bigger than 65535. Like Fermi architecture GPUs. The crash will happen in src/caffe/layers/relu_layer.cu line 29. as BVLC#282 BVLC#12 ![image](https://cloud.githubusercontent.com/assets/5321224/3595637/4c27157c-0cb8-11e4-8009-c40c88ac1500.png) Fixed it with // CUDA: number of blocks for threads. inline int CAFFE_GET_BLOCKS(const int N) { //return (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS; int num_blocks = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS; return num_blocks > Caffe::cuProp().maxGridSize[0]? num_blocks : Caffe::cuProp().maxGridSize[0]; }

Update

DEV-26376: Recode python layer to C++ in detection net

jermainewang mentioned this issue Jan 20, 2014

Cuda kernel crash #39

Closed

shelhamer closed this as completed Feb 25, 2014

jamt9000 mentioned this issue Mar 11, 2014

make the kernel mul function more safer. #202

Closed

caffecuda mentioned this issue Apr 1, 2014

"relu_layer.cu:29] Cuda kernel failed. Error: invalid configuration argument" on ilsvrc12 #282

Closed

ssierral mentioned this issue Jun 12, 2014

Training Imagenet on Tesla M2050 #493

Closed

OpenHero mentioned this issue Jul 16, 2014

Fix a bug #707

Closed

This was referenced Jul 16, 2014

Fix bug of #12 #282 #709

Closed

Support Windows 7 64bit VS2012 and fixed CUDA grid blocks limited. #727

Closed

PiranjaF mentioned this issue Apr 21, 2015

Strange drops in loss/accuracy during training #2343

Closed

chensiqin mentioned this issue Nov 28, 2015

Output accuracies per class. #2935

Merged

xnming mentioned this issue Jan 8, 2016

Segmentation fault when run test #3531

Open

anuphalarnkar mentioned this issue Jan 12, 2016

Segmentation fault after make runtest on Ubuntu 14.04/ppc64le #3539

Closed

anuphalarnkar mentioned this issue Jan 29, 2016

Fix crash when pairing an odd number of devices without P2P (BVLC/github issue #3531) #3586

Closed

bharatsau mentioned this issue Apr 22, 2016

Wrong output while using combination of DummyData and Dropout layer #4031

Closed

andpol5 pushed a commit to andpol5/caffe that referenced this issue Aug 24, 2016

Merge pull request BVLC#12 from BVLC/master

db5c7d0

Update

JonBoyleCoding mentioned this issue Oct 26, 2016

Caffe stuck waiting on multiple boost::condition_variable in all threads in caffe::BlockingQueue #4904

Closed

mbassov pushed a commit to mbassov/caffe that referenced this issue Nov 10, 2017

Merge pull request BVLC#12 from jessebrizzi/DEV-26376

3dd04cc

DEV-26376: Recode python layer to C++ in detection net

shuguang101 mentioned this issue Jan 20, 2018

Segmentation Fault: 11 - OSX high sierra - please Help #6019

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum CUDA arch == compute capability 2.0? #12

Minimum CUDA arch == compute capability 2.0? #12

reedscot commented Dec 10, 2013

mavenlin commented Dec 11, 2013

reedscot commented Dec 11, 2013

SWu commented Dec 21, 2013

kloudkl commented Jan 13, 2014

SWu commented Jan 14, 2014

everanurag commented Jan 27, 2014

sguada commented Jan 27, 2014

everanurag commented Jan 27, 2014

Yangqing commented Jan 27, 2014

everanurag commented Jan 27, 2014

jamt9000 commented Mar 10, 2014

Minimum CUDA arch == compute capability 2.0? #12

Minimum CUDA arch == compute capability 2.0? #12

Comments

reedscot commented Dec 10, 2013

mavenlin commented Dec 11, 2013

reedscot commented Dec 11, 2013

SWu commented Dec 21, 2013

kloudkl commented Jan 13, 2014

SWu commented Jan 14, 2014

everanurag commented Jan 27, 2014

sguada commented Jan 27, 2014

everanurag commented Jan 27, 2014

Yangqing commented Jan 27, 2014

everanurag commented Jan 27, 2014

jamt9000 commented Mar 10, 2014