Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compatibility with CUDA 2.1 #5002

Closed
wants to merge 5 commits into from
Closed

Conversation

uvNikita
Copy link

@uvNikita uvNikita commented Nov 18, 2016

Limit CAFFE_GET_BLOCKS to return not more than max device supported blocks number

See this and this issues and #707.

The problem is that CAFFE_GET_BLOCKS returns more blocks than devices with < 3.0 support can handle. On this wiki page in the table with row "Maximum x-dimension of a grid of thread blocks" you can see that the maximum dimension for CUDA < 3.0 is 65535.

I'm getting this error with GTX 580 when trying to train images with resolution > 198x198, independently of batch_size.

Limit CAFFE_GET_BLOCKS to return not more then max device supported blocks number
@williford
Copy link
Contributor

Just linking to your post on caffe-users:
https://groups.google.com/d/topic/caffe-users/8mRYC_uXVzM/discussion

CUDA_CHECK(cudaGetDeviceProperties(&prop, device));
int num_blocks = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
int max_blocks = prop.maxGridSize[0];
return num_blocks < max_blocks ? num_blocks : max_blocks;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there is reason why you don't use return std::min<int>(num_blocks, max_blocks); ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really :) I can change it, thank you.

@flx42
Copy link
Contributor

flx42 commented Nov 19, 2016

This helper function is called for each kernel, for each layer. Therefore it's clearly in the critical path for the forward and backward passes.
As a result, it doesn't seem like a good idea to add O(2 * num_layers) calls to the CUDA API just for supporting an architecture that is being deprecated in the CUDA toolkit

@uvNikita
Copy link
Author

This is very valid point. What we can do is to make api call only once and cache the result for next calls of the function.

Probably title of the PR is a bit misleading, the point is that this function should be fixed in any case since its broken for any architecture, it's just currently occurs more often on the old devices due to smaller limitation on maximum block size. Eventually currently supported limits for CUDA 3.0 will not be enough and this error will start to appear on new devices as well. The main point is that CAFFE_GET_BLOCKS function should never return value more than the maximum supported value by architecture. It is an additional bonus that this fix will allow to run caffe on older cards as well.

Therefore my main point is that this function should be fixed in any case and it's very good that this is the only necessary change to enable support of big number of old devices which are still fully suitable for training in realworld tasks.

Declare device properties variable as a static one which allows to initialize it only once using CUDA api call
@uvNikita
Copy link
Author

Current version will make api call only on first function execution and will reuse value for the future calls.

@flx42
Copy link
Contributor

flx42 commented Nov 19, 2016

Eventually currently supported limits for CUDA 3.0 will not be enough and this error will start to appear on new devices as well.

Are you concerned we might also have a problem in the future if we ever launch more than 2147483647 CUDA blocks? :)

@uvNikita
Copy link
Author

Well, current implementation do not influence performance in any way (since value is cached), so by adding this 10 lines of code we are enabling support of huge number of users with old GPUs.

What could be the reasons not to add this support? The code is semantically correct, no performance penalties and happy users :)

@flx42
Copy link
Contributor

flx42 commented Nov 19, 2016

I'm not a maintainer so it's not up to me.

But:

  • You are caching the properties of the first device encountered, that seems arbitrary, they might not all be the same and then you haven't solved your problem.
  • Initialization of a local static variable is not thread-safe before C++11.
  • Your code assumes it's fine to silently launch fewer blocks that requested. It's only OK if the kernel was written in a way were threads can process multiple elements, for instance by using CUDA_KERNEL_LOOP. If it's not the case, not enough threads will be launched and your input will be partially processed, I prefer receiving an error than silent truncation of my output.

@uvNikita
Copy link
Author

Thank you for feedback.

Yes, this is valid point. I was trying to fix this issue, but it seems I have lack of experience in C++ and straggling with its module system. Here is the code that should work with both multiple and single GPU, the only problem is that I can't include "caffe/caffe.hpp" where Caffe is declared (I need it to access Caffe::device_id):

// CUDA: number of blocks for threads.
inline int CAFFE_GET_BLOCKS(const int N) {
  // maximum number of blocks is limited by the device with the smallest compute capability
  static int max_blocks = -1;
  if (max_blocks == -1) {
      int count = 0;
      cudaGetDeviceCount(&count);

      cudaDeviceProp device_prop;
      cudaGetDeviceProperties(&device_prop, Caffe::device_id);
      max_blocks = device_prop.maxGridSize[0];

      for (int i = 1; i < count; ++i) {
        cudaGetDeviceProperties(&device_prop, i);
        max_blocks = std::min<int>(max_blocks, device_prop.maxGridSize[0]);
      }
  }

  int num_blocks = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
  return std::min<int>(num_blocks, max_blocks);
}

Therefore if somebody can help me to include this module somehow, I can continue to work on this issue.

As for two other points, I'm not sure how relevant it is for this project, so maybe someone from maintainers can say something about that.

Found this opened issue about the same problem: #4287.

@uvNikita uvNikita closed this Jan 29, 2017
@uvNikita uvNikita deleted the patch-1 branch January 29, 2017 21:06
@uvNikita uvNikita restored the patch-1 branch January 29, 2017 21:08
@uvNikita uvNikita reopened this Jan 29, 2017
@shelhamer
Copy link
Member

Thank you for your effort on this, but I agree with the issues raised in #5002 (comment) and support for CUDA < 3.0 is not a common request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants