-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix compatibility with CUDA 2.1 #5002
Conversation
Limit CAFFE_GET_BLOCKS to return not more then max device supported blocks number
Just linking to your post on caffe-users: |
CUDA_CHECK(cudaGetDeviceProperties(&prop, device)); | ||
int num_blocks = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS; | ||
int max_blocks = prop.maxGridSize[0]; | ||
return num_blocks < max_blocks ? num_blocks : max_blocks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there is reason why you don't use return std::min<int>(num_blocks, max_blocks);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really :) I can change it, thank you.
This helper function is called for each kernel, for each layer. Therefore it's clearly in the critical path for the forward and backward passes. |
This is very valid point. What we can do is to make api call only once and cache the result for next calls of the function. Probably title of the PR is a bit misleading, the point is that this function should be fixed in any case since its broken for any architecture, it's just currently occurs more often on the old devices due to smaller limitation on maximum block size. Eventually currently supported limits for CUDA 3.0 will not be enough and this error will start to appear on new devices as well. The main point is that Therefore my main point is that this function should be fixed in any case and it's very good that this is the only necessary change to enable support of big number of old devices which are still fully suitable for training in realworld tasks. |
Declare device properties variable as a static one which allows to initialize it only once using CUDA api call
Current version will make api call only on first function execution and will reuse value for the future calls. |
Are you concerned we might also have a problem in the future if we ever launch more than 2147483647 CUDA blocks? :) |
Well, current implementation do not influence performance in any way (since value is cached), so by adding this 10 lines of code we are enabling support of huge number of users with old GPUs. What could be the reasons not to add this support? The code is semantically correct, no performance penalties and happy users :) |
I'm not a maintainer so it's not up to me. But:
|
Thank you for feedback. Yes, this is valid point. I was trying to fix this issue, but it seems I have lack of experience in C++ and straggling with its module system. Here is the code that should work with both multiple and single GPU, the only problem is that I can't include "caffe/caffe.hpp" where
Therefore if somebody can help me to include this module somehow, I can continue to work on this issue. As for two other points, I'm not sure how relevant it is for this project, so maybe someone from maintainers can say something about that. Found this opened issue about the same problem: #4287. |
Thank you for your effort on this, but I agree with the issues raised in #5002 (comment) and support for CUDA < 3.0 is not a common request. |
Limit CAFFE_GET_BLOCKS to return not more than max device supported blocks number
See this and this issues and #707.
The problem is that
CAFFE_GET_BLOCKS
returns more blocks than devices with < 3.0 support can handle. On this wiki page in the table with row "Maximum x-dimension of a grid of thread blocks" you can see that the maximum dimension for CUDA < 3.0 is 65535.I'm getting this error with GTX 580 when trying to train images with resolution > 198x198, independently of
batch_size
.