Fix compatibility with CUDA 2.1 #5002

uvNikita · 2016-11-18T14:37:33Z

Limit CAFFE_GET_BLOCKS to return not more than max device supported blocks number

See this and this issues and #707.

The problem is that CAFFE_GET_BLOCKS returns more blocks than devices with < 3.0 support can handle. On this wiki page in the table with row "Maximum x-dimension of a grid of thread blocks" you can see that the maximum dimension for CUDA < 3.0 is 65535.

I'm getting this error with GTX 580 when trying to train images with resolution > 198x198, independently of batch_size.

Limit CAFFE_GET_BLOCKS to return not more then max device supported blocks number

williford · 2016-11-18T15:15:07Z

Just linking to your post on caffe-users:
https://groups.google.com/d/topic/caffe-users/8mRYC_uXVzM/discussion

williford · 2016-11-18T15:21:58Z

include/caffe/util/device_alternate.hpp

+  CUDA_CHECK(cudaGetDeviceProperties(&prop, device));
+  int num_blocks = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
+  int max_blocks = prop.maxGridSize[0];
+  return num_blocks < max_blocks ? num_blocks : max_blocks;


Is there is reason why you don't use return std::min<int>(num_blocks, max_blocks); ?

Not really :) I can change it, thank you.

flx42 · 2016-11-19T02:20:50Z

This helper function is called for each kernel, for each layer. Therefore it's clearly in the critical path for the forward and backward passes.
As a result, it doesn't seem like a good idea to add O(2 * num_layers) calls to the CUDA API just for supporting an architecture that is being deprecated in the CUDA toolkit

uvNikita · 2016-11-19T10:14:46Z

This is very valid point. What we can do is to make api call only once and cache the result for next calls of the function.

Probably title of the PR is a bit misleading, the point is that this function should be fixed in any case since its broken for any architecture, it's just currently occurs more often on the old devices due to smaller limitation on maximum block size. Eventually currently supported limits for CUDA 3.0 will not be enough and this error will start to appear on new devices as well. The main point is that CAFFE_GET_BLOCKS function should never return value more than the maximum supported value by architecture. It is an additional bonus that this fix will allow to run caffe on older cards as well.

Therefore my main point is that this function should be fixed in any case and it's very good that this is the only necessary change to enable support of big number of old devices which are still fully suitable for training in realworld tasks.

Declare device properties variable as a static one which allows to initialize it only once using CUDA api call

uvNikita · 2016-11-19T17:29:01Z

Current version will make api call only on first function execution and will reuse value for the future calls.

flx42 · 2016-11-19T17:52:41Z

Eventually currently supported limits for CUDA 3.0 will not be enough and this error will start to appear on new devices as well.

Are you concerned we might also have a problem in the future if we ever launch more than 2147483647 CUDA blocks? :)

uvNikita · 2016-11-19T18:47:53Z

Well, current implementation do not influence performance in any way (since value is cached), so by adding this 10 lines of code we are enabling support of huge number of users with old GPUs.

What could be the reasons not to add this support? The code is semantically correct, no performance penalties and happy users :)

flx42 · 2016-11-19T19:26:34Z

I'm not a maintainer so it's not up to me.

But:

You are caching the properties of the first device encountered, that seems arbitrary, they might not all be the same and then you haven't solved your problem.
Initialization of a local static variable is not thread-safe before C++11.
Your code assumes it's fine to silently launch fewer blocks that requested. It's only OK if the kernel was written in a way were threads can process multiple elements, for instance by using CUDA_KERNEL_LOOP. If it's not the case, not enough threads will be launched and your input will be partially processed, I prefer receiving an error than silent truncation of my output.

uvNikita · 2016-11-20T15:21:01Z

Thank you for feedback.

Yes, this is valid point. I was trying to fix this issue, but it seems I have lack of experience in C++ and straggling with its module system. Here is the code that should work with both multiple and single GPU, the only problem is that I can't include "caffe/caffe.hpp" where Caffe is declared (I need it to access Caffe::device_id):

// CUDA: number of blocks for threads.
inline int CAFFE_GET_BLOCKS(const int N) {
  // maximum number of blocks is limited by the device with the smallest compute capability
  static int max_blocks = -1;
  if (max_blocks == -1) {
      int count = 0;
      cudaGetDeviceCount(&count);

      cudaDeviceProp device_prop;
      cudaGetDeviceProperties(&device_prop, Caffe::device_id);
      max_blocks = device_prop.maxGridSize[0];

      for (int i = 1; i < count; ++i) {
        cudaGetDeviceProperties(&device_prop, i);
        max_blocks = std::min<int>(max_blocks, device_prop.maxGridSize[0]);
      }
  }

  int num_blocks = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
  return std::min<int>(num_blocks, max_blocks);
}

Therefore if somebody can help me to include this module somehow, I can continue to work on this issue.

As for two other points, I'm not sure how relevant it is for this project, so maybe someone from maintainers can say something about that.

Found this opened issue about the same problem: #4287.

shelhamer · 2017-04-14T06:38:52Z

Thank you for your effort on this, but I agree with the issues raised in #5002 (comment) and support for CUDA < 3.0 is not a common request.

Fix compatibility with CUDA 2.1

029169c

Limit CAFFE_GET_BLOCKS to return not more then max device supported blocks number

williford reviewed Nov 18, 2016

View reviewed changes

uvNikita added 3 commits November 18, 2016 18:40

Use std::min instead of ternary operator

0c93447

Add missing include for min function

38fca10

Change include to alphabetical order

dfee052

Get CUDA device properties only on the first call of CAFFE_GET_BLOCKS

5f5b3d8

Declare device properties variable as a static one which allows to initialize it only once using CUDA api call

uvNikita closed this Jan 29, 2017

uvNikita deleted the patch-1 branch January 29, 2017 21:06

uvNikita restored the patch-1 branch January 29, 2017 21:08

uvNikita reopened this Jan 29, 2017

shelhamer closed this Apr 14, 2017

duygusar mentioned this pull request Jan 12, 2018

input_dim : changed on-the-fly to num ROIs --- roi_pooling_layer.cu:97] Check failed: error == cudaSuccess (9 vs. 0) invalid configuration argument ShaoqingRen/faster_rcnn#37

Open

abhi2610 mentioned this pull request Apr 12, 2018

F0116 07:59:24.649296 18466 roi_pooling_layer.cu:91] Check failed: error == cudaSuccess (9 vs. 0) invalid configuration argument abhi2610/ohem#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix compatibility with CUDA 2.1 #5002

Fix compatibility with CUDA 2.1 #5002

uvNikita commented Nov 18, 2016 •

edited

Loading

williford commented Nov 18, 2016

williford Nov 18, 2016

uvNikita Nov 18, 2016

flx42 commented Nov 19, 2016

uvNikita commented Nov 19, 2016

uvNikita commented Nov 19, 2016

flx42 commented Nov 19, 2016

uvNikita commented Nov 19, 2016

flx42 commented Nov 19, 2016

uvNikita commented Nov 20, 2016

shelhamer commented Apr 14, 2017

Fix compatibility with CUDA 2.1 #5002

Fix compatibility with CUDA 2.1 #5002

Conversation

uvNikita commented Nov 18, 2016 • edited Loading

williford commented Nov 18, 2016

williford Nov 18, 2016

Choose a reason for hiding this comment

uvNikita Nov 18, 2016

Choose a reason for hiding this comment

flx42 commented Nov 19, 2016

uvNikita commented Nov 19, 2016

uvNikita commented Nov 19, 2016

flx42 commented Nov 19, 2016

uvNikita commented Nov 19, 2016

flx42 commented Nov 19, 2016

uvNikita commented Nov 20, 2016

shelhamer commented Apr 14, 2017

uvNikita commented Nov 18, 2016 •

edited

Loading