You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We got the master branch on March 28, while the mnist and cifar10 demos worked fine on Tesla M2090, the ilsvrc12 training script gave the following error:
… of a grid of thread blocks is 65535. So the Kernel will crash when the CAFFE_GET_BLOCKS is bigger than 65535. Like Fermi architecture GPUs.
The crash will happen in src/caffe/layers/relu_layer.cu line 29. as BVLC#282BVLC#12
![image](https://cloud.githubusercontent.com/assets/5321224/3595637/4c27157c-0cb8-11e4-8009-c40c88ac1500.png)
Fixed it with
// CUDA: number of blocks for threads.
inline int CAFFE_GET_BLOCKS(const int N) {
//return (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
int num_blocks = (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
return num_blocks > Caffe::cuProp().maxGridSize[0]? num_blocks : Caffe::cuProp().maxGridSize[0];
}
Hi,
We got the master branch on March 28, while the mnist and cifar10 demos worked fine on Tesla M2090, the ilsvrc12 training script gave the following error:
...
I0401 15:13:18.902433 3143 net.cpp:173] Collecting Learning Rate and Weight Decay.
I0401 15:13:18.902487 3143 net.cpp:166] Network initialization done.
I0401 15:13:18.902521 3143 net.cpp:167] Memory required for Data 210114408
I0401 15:13:18.902660 3143 solver.cpp:36] Solver scaffolding done.
I0401 15:13:18.902760 3143 solver.cpp:47] Solving CaffeNet
F0401 15:13:19.040458 3143 relu_layer.cu:29] Cuda kernel failed. Error: invalid configuration argument
*** Check failure stack trace: ***
@ 0x2b06c5dcbb7d google::LogMessage::Fail()
@ 0x2b06c5dcdc7f google::LogMessage::SendToLog()
@ 0x2b06c5dcb76c google::LogMessage::Flush()
@ 0x2b06c5dce51d google::LogMessageFatal::~LogMessageFatal()
@ 0x48188c caffe::ReLULayer<>::Forward_gpu()
@ 0x431ada caffe::Net<>::ForwardPrefilled()
@ 0x423cb8 caffe::Solver<>::Solve()
@ 0x40e645 main
@ 0x2b06c806176d (unknown)
@ 0x40fced (unknown)
Aborted (core dumped)
Done.
Is this related to the issue described in #12?
The text was updated successfully, but these errors were encountered: