Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryDataLayer Problem #2334

Closed
TJKlein opened this issue Apr 18, 2015 · 35 comments
Closed

MemoryDataLayer Problem #2334

TJKlein opened this issue Apr 18, 2015 · 35 comments
Labels

Comments

@TJKlein
Copy link

TJKlein commented Apr 18, 2015

Hi,

when choosing MemoryDataLayer as input in Python I get segmentation faults and Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0) CUBLAS_STATUS_INTERNAL_ERROR.
For small data sets it runs for a while until the Python kernel crashes with the above error. For large data sets this error occurs immediately. Tested in on both OSX and Linux, with the latest master branch. Same issue

However, when using HDF5 layer input for the exactly same data / network, it works perfectly. I used the networks as defined in the tutorial - with the adaption of MemoryDataLayer:

http://nbviewer.ipython.org/github/BVLC/caffe/blob/tutorial/examples/01-learning-lenet.ipynb

I am also training / testing on MNIST data set

layer {
  name: "data"
  type: "MemoryData"
  top: "data"
  top: "label"
  memory_data_param {
    batch_size: 50
    channels: 1
    height: 28
    width: 28
  }
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  convolution_param {
    num_output: 20
    kernel_size: 5
    weight_filler {
      type: "xavier"
    }
  }
}
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  convolution_param {
    num_output: 50
    kernel_size: 5
    weight_filler {
      type: "xavier"
    }
  }
}
layer {
  name: "pool2"
  type: "Pooling"
  bottom: "conv2"
  top: "pool2"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "pool2"
  top: "ip1"
  inner_product_param {
    num_output: 500
    weight_filler {
      type: "xavier"
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "ip1"
  top: "ip1"
}
layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "xavier"
    }
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label"
  top: "loss"
}

Setting up the data and the solver

num_elements_train = 50000
num_elements_test = 2000

train_set_x = train_set_x[0:num_elements_train,:].eval()
train_set_y = train_set_y[0:num_elements_train].eval()
test_set_x = test_set_x[0:num_elements_test,:].eval()
test_set_y = test_set_y[0:num_elements_test].eval()

train_set_x = train_set_x.reshape((num_elements_train,1,28,28))
train_set_y = train_set_y.reshape((num_elements_train))

test_set_x = test_set_x.reshape((num_elements_test,1,28,28))
test_set_y = test_set_y.reshape((num_elements_test))


solver.net.set_input_arrays(train_set_x.astype(np.float32),train_set_y.astype(np.float32))
solver.test_nets[0].set_input_arrays(test_set_x.astype(np.float32),test_set_y.astype(np.float32))

Same as in the example code:

niter = 1000
test_interval = 25
# losses will also be stored in the log
train_loss = zeros(niter)
test_acc = zeros(int(np.ceil(niter / test_interval)))
output = zeros((niter, 8, 10))

# the main solver loop
for it in range(niter):
    solver.step(1)  # SGD by Caffe

    # store the train loss
    train_loss[it] = solver.net.blobs['loss'].data

    # store the output on the first test batch
    # (start the forward pass at conv1 to avoid loading new data)
    solver.test_nets[0].forward(start='conv1')
    output[it] = solver.test_nets[0].blobs['ip2'].data[:8]

    # run a full test every so often
    # (Caffe can also do this for us and write to a log, but we show here
    #  how to do it directly in Python, where more complicated things are easier.)
    if it % test_interval == 0:
        print 'Iteration', it, 'testing...'
        correct = 0
        for test_it in range(100):
            solver.test_nets[0].forward()
            correct += sum(solver.test_nets[0].blobs['ip2'].data.argmax(1)
                           == solver.test_nets[0].blobs['label'].data)
        test_acc[it // test_interval] = correct / 1e4
@evolu8
Copy link

evolu8 commented Apr 20, 2015

how are you initially setting train_set_x and y?

@TJKlein
Copy link
Author

TJKlein commented Apr 20, 2015

I tried with MNIST data


from logistic_sgd import load_data


datasets = load_data('mnist.pkl.gz')

train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]


num_elements_train = 10000
num_elements_test = 100

train_set_x = train_set_x[0:num_elements_train,:].eval()
train_set_y = train_set_y[0:num_elements_train].eval()

test_set_x = test_set_x[0:num_elements_test,:].eval()
test_set_y = test_set_y[0:num_elements_test].eval()


train_set_x = train_set_x.reshape((num_elements_train,1,28,28))
train_set_y = train_set_y.reshape((num_elements_train))

test_set_x = test_set_x.reshape((num_elements_test,1,28,28))
test_set_y = test_set_y.reshape((num_elements_test))

and also tried zero data:

train_set_x = np.zeros(train_set_x.shape,np.float32)
train_set_y = np.zeros(train_set_y.shape, np.float32)

test_set_x = np.zeros(test_set_x.shape,np.float32)
test_set_y = np.zeros(test_set_y.shape, np.float32)

@evolu8
Copy link

evolu8 commented Apr 25, 2015

Will hopeful play with this this weekend as its something I need to achieve sometime soon. Hope I'm able to shed some light.

@longjon longjon added the bug label May 8, 2015
@longjon
Copy link
Contributor

longjon commented May 8, 2015

Does this happen in DEBUG mode? Can you provide a backtrace? What's your distro/hardware?

@TJKlein
Copy link
Author

TJKlein commented May 8, 2015

I tested in on many machines such as Ubuntu 14.04 with Geforce GTX 760 and on MacBook Pro (Late 2013) OSX 10.10.3

It happens in release mode.

The stack trace:
F0508 01:16:59.312582 1902506752 math_functions.cu:123] Check failed: status == CUBLAS_STATUS_SUCCESS (14 vs. 0) CUBLAS_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***

@longjon
Copy link
Contributor

longjon commented May 8, 2015

Right, can you try in DEBUG mode and see if you still get the same error? DEBUG has additional checks that might reveal the source of the issue.

That's just the error above, not the stack trace; was anything printed below? (I believe there might be an issue where stack traces don't get printed in pycaffe, so there might not be; in that case you would need to use gdb to get the trace.)

@TJKlein
Copy link
Author

TJKlein commented May 8, 2015

Yes, still the same error in DEBUG mode. In Pycaffe there is no other info than the previously posted.
With gdb I get the following:

CUBLAS_STATUS_SUCCESS (14 vs. 0)  CUBLAS_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***

Program received signal SIGABRT, Aborted.
0x00007ffff6d3ecc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.

@longjon
Copy link
Contributor

longjon commented May 8, 2015

Okay, thanks. In gdb, type bt to get a backtrace...

@TJKlein
Copy link
Author

TJKlein commented May 8, 2015


#0  0x00007ffff6d3ecc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff6d420d8 in __GI_abort () at abort.c:89
#2  0x00007fffe9d4bd81 in ?? () from /usr/lib/x86_64-linux-gnu/libglog.so.0
#3  0x00007fffe9d4bdaa in google::LogMessage::Fail() () from /usr/lib/x86_64-linux-gnu/libglog.so.0
#4  0x00007fffe9d4bce4 in google::LogMessage::SendToLog() () from /usr/lib/x86_64-linux-gnu/libglog.so.0
#5  0x00007fffe9d4b6e6 in google::LogMessage::Flush() () from /usr/lib/x86_64-linux-gnu/libglog.so.0
#6  0x00007fffe9d4e687 in google::LogMessageFatal::~LogMessageFatal() ()
   from /usr/lib/x86_64-linux-gnu/libglog.so.0
#7  0x00007fffea2c4ae0 in caffe::caffe_gpu_asum<float> (n=50, x=0x2301fc5c00, y=0x7fffffffd420)
    at src/caffe/util/math_functions.cu:123
#8  0x00007fffea2eaf4c in caffe::SoftmaxWithLossLayer<float>::Forward_gpu (this=0x65543a0, bottom=..., 
    top=...) at src/caffe/layers/softmax_loss_layer.cu:52
#9  0x00007fffea1d4ba9 in caffe::Layer<float>::Forward (this=0x65543a0, bottom=..., top=...)
    at ./include/caffe/layer.hpp:421
#10 0x00007fffea1c67ab in caffe::Net<float>::ForwardFromTo (this=0x6550790, start=0, end=8)
    at src/caffe/net.cpp:474
#11 0x00007fffea1c64ff in caffe::Net<float>::ForwardPrefilled (this=0x6550790, loss=0x7fffffffd62c)
    at src/caffe/net.cpp:494
#12 0x00007fffea1c6940 in caffe::Net<float>::Forward (this=0x6550790, bottom=..., loss=0x7fffffffd62c)
    at src/caffe/net.cpp:508
#13 0x00007fffea1c7255 in caffe::Net<float>::ForwardBackward (this=0x6550790, bottom=...)
    at ./include/caffe/net.hpp:80
#14 0x00007fffea1ee907 in caffe::Solver<float>::Step (this=0x5e0b910, iters=1) at src/caffe/solver.cpp:178

#15 0x00007fffeaf94b12 in boost::python::objects::caller_py_function_impl<boost::python::detail::caller<void (caffe::Solver<float>::*)(int), boost::python::default_call_policies, boost::mpl::vector3<void, caffe::Solver<float>&, int> > >::operator()(_object*, _object*) () from /home/tjk/caffe/python/caffe/_caffe.so
#16 0x00007fffe950f64a in boost::python::objects::function::call(_object*, _object*) const ()
   from /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0
#17 0x00007fffe950f9b8 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0
#18 0x00007fffe9519c93 in boost::python::handle_exception_impl(boost::function0<void>) ()
   from /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0
#19 0x00007fffe950e2c3 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0
#20 0x00007ffff7a40323 in PyObject_Call (func=0xb1adf0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#21 0x00007ffff7aef362 in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffffffdb48, 
    func=0xb1adf0) at Python/ceval.c:4251
#22 call_function (oparg=<optimized out>, pp_stack=0x7fffffffdb48) at Python/ceval.c:4056
#23 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2679
#24 0x00007ffff7af1c6e in PyEval_EvalCodeEx (co=0x7ffff7ec7c30, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:3265
#25 0x00007ffff7af1d82 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, 
    locals=<optimized out>) at Python/ceval.c:667
#26 0x00007ffff7b11ba0 in run_mod (arena=0x65a480, flags=0x7fffffffde40, locals=0x7ffff7f67168, 
    globals=0x7ffff7f67168, filename=<optimized out>, mod=0x6ad9f8) at Python/pythonrun.c:1371
#27 PyRun_FileExFlags (fp=0x6ba110, filename=<optimized out>, start=<optimized out>, 
---Type <return> to continue, or q <return> to quit---
    globals=0x7ffff7f67168, locals=0x7ffff7f67168, closeit=1, flags=0x7fffffffde40)
    at Python/pythonrun.c:1357
#28 0x00007ffff7b11d7f in PyRun_SimpleFileExFlags (fp=0x6ba110, 
    filename=0x7fffffffe2f9 "memoryLayerBug.py", closeit=1, flags=0x7fffffffde40) at Python/pythonrun.c:949
#29 0x00007ffff7b27664 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:645
#30 0x00007ffff6d29ec5 in __libc_start_main (main=0x400710 <main>, argc=2, argv=0x7fffffffdf68, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdf58)
    at libc-start.c:287
#31 0x0000000000400649 in _start ()


@TJKlein
Copy link
Author

TJKlein commented May 8, 2015

Interestingly enough, when I run it in CPU instead of GPU mode, it stops because labels are of negative value (although they are not, in Python they are in range [0,K], where K is some small positive integer)

F0508 14:25:44.653395 20384 softmax_loss_layer.cpp:68] Check failed: label_value >= 0 (-2147483648 vs. 0) 
*** Check failure stack trace: ***

Program received signal SIGABRT, Aborted.
0x00007ffff6d3ecc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6d3ecc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff6d420d8 in __GI_abort () at abort.c:89
#2  0x00007fffe9d4bd81 in ?? () from /usr/lib/x86_64-linux-gnu/libglog.so.0
#3  0x00007fffe9d4bdaa in google::LogMessage::Fail() () from /usr/lib/x86_64-linux-gnu/libglog.so.0
#4  0x00007fffe9d4bce4 in google::LogMessage::SendToLog() () from /usr/lib/x86_64-linux-gnu/libglog.so.0
#5  0x00007fffe9d4b6e6 in google::LogMessage::Flush() () from /usr/lib/x86_64-linux-gnu/libglog.so.0
#6  0x00007fffe9d4e687 in google::LogMessageFatal::~LogMessageFatal() ()
   from /usr/lib/x86_64-linux-gnu/libglog.so.0
#7  0x00007fffea29ea41 in caffe::SoftmaxWithLossLayer<float>::Forward_cpu (this=0x6568800, bottom=..., 
    top=...) at src/caffe/layers/softmax_loss_layer.cpp:68
#8  0x00007fffea1d4a96 in caffe::Layer<float>::Forward (this=0x6568800, bottom=..., top=...)
    at ./include/caffe/layer.hpp:411
#9  0x00007fffea1c67ab in caffe::Net<float>::ForwardFromTo (this=0x5f37890, start=0, end=8)
    at src/caffe/net.cpp:474
#10 0x00007fffea1c64ff in caffe::Net<float>::ForwardPrefilled (this=0x5f37890, loss=0x7fffffffd62c)
    at src/caffe/net.cpp:494
#11 0x00007fffea1c6940 in caffe::Net<float>::Forward (this=0x5f37890, bottom=..., loss=0x7fffffffd62c)
    at src/caffe/net.cpp:508
#12 0x00007fffea1c7255 in caffe::Net<float>::ForwardBackward (this=0x5f37890, bottom=...)
    at ./include/caffe/net.hpp:80
#13 0x00007fffea1ee907 in caffe::Solver<float>::Step (this=0x5f49dc0, iters=1) at src/caffe/solver.cpp:178
#14 0x00007fffeaf94b12 in boost::python::objects::caller_py_function_impl<boost::python::detail::caller<void (caffe::Solver<float>::*)(int), boost::python::default_call_policies, boost::mpl::vector3<void, caffe::Solver<float>&, int> > >::operator()(_object*, _object*) () from /home/tjk/caffe/python/caffe/_caffe.so
#15 0x00007fffe950f64a in boost::python::objects::function::call(_object*, _object*) const ()
   from /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0
#16 0x00007fffe950f9b8 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0
#17 0x00007fffe9519c93 in boost::python::handle_exception_impl(boost::function0<void>) ()
   from /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0
#18 0x00007fffe950e2c3 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0
#19 0x00007ffff7a40323 in PyObject_Call (func=0xb1b1a0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#20 0x00007ffff7aef362 in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffffffdb48, 
    func=0xb1b1a0) at Python/ceval.c:4251
#21 call_function (oparg=<optimized out>, pp_stack=0x7fffffffdb48) at Python/ceval.c:4056
#22 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2679
#23 0x00007ffff7af1c6e in PyEval_EvalCodeEx (co=0x7ffff7ec7c30, globals=<optimized out>, 
    locals=<optimized out>, args=<optimized out>, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:3265
#24 0x00007ffff7af1d82 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, 
    locals=<optimized out>) at Python/ceval.c:667
#25 0x00007ffff7b11ba0 in run_mod (arena=0x65a480, flags=0x7fffffffde40, locals=0x7ffff7f67168, 
    globals=0x7ffff7f67168, filename=<optimized out>, mod=0x6adb38) at Python/pythonrun.c:1371
#26 PyRun_FileExFlags (fp=0x6ba110, filename=<optimized out>, start=<optimized out>, 
    globals=0x7ffff7f67168, locals=0x7ffff7f67168, closeit=1, flags=0x7fffffffde40)
---Type <return> to continue, or q <return> to quit---
    at Python/pythonrun.c:1357
#27 0x00007ffff7b11d7f in PyRun_SimpleFileExFlags (fp=0x6ba110, 
    filename=0x7fffffffe2f9 "memoryLayerBug.py", closeit=1, flags=0x7fffffffde40) at Python/pythonrun.c:949
#28 0x00007ffff7b27664 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:645
#29 0x00007ffff6d29ec5 in __libc_start_main (main=0x400710 <main>, argc=2, argv=0x7fffffffdf68, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdf58)
    at libc-start.c:287
#30 0x0000000000400649 in _start ()

@TJKlein
Copy link
Author

TJKlein commented May 8, 2015

It seems like it doesn't get the right pointer for the train data set.
For the example I used a training set of N=500 and a test set of N=200.
When it is switching from one to the other, the error occurs. Labels were filled from 0 to 9 sequentially for testing (image size 28x28).

Here is some output of the memory layer that I print to the console during the training/test

[...]
--------------------------------------------
Labels: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
pos_: 150
size_: 784
n_: 200
batch_size: 50
--------------------------------------------
Labels: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
pos_: 0
size_: 784
n_: 200
batch_size: 50
--------------------------------------------
Labels: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
I0508 16:39:41.980954  8736 solver.cpp:315]     Test net output #0: loss = 2.30258 (* 1 = 2.30258 loss)
pos_: 50
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
I0508 16:39:42.396299  8736 solver.cpp:189] Iteration 0, loss = 2.30258
I0508 16:39:42.396358  8736 solver.cpp:204]     Train net output #0: loss = 2.30258 (* 1 = 2.30258 loss)
I0508 16:39:42.396375  8736 solver.cpp:464] Iteration 0, lr = 0.01
pos_: 100
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0, 0, 0, 0, -0, -0, -0, 0, -0, 0, 0, 0, -0, -0, -0, 0, -0, 0, 0, 0, -0, -0, -0, 0, -0, 0, 0, 0, -0, -0, -0, 0, -0, 0, 0, 0, -0, -0, -0, 0, -0, 0, 0, 0, -0, -0, -0, 0, -0, 0, 
pos_: 150
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0.000709686, 0.000210766, 0.00124046, 0.000644926, -0, -0, -0, 0.000817565, -0.000709686, 0.000210766, 0.00124046, 0.000644926, -0, -0, -0, 0.000817565, -0.000709686, 0.000210766, 0.00124046, 0.000644926, -0, -0, -0, 0.000817565, -0.000709686, 0.000210766, 0.00124046, 0.000644926, -0, -0, -0, 0.000817565, -0.000709686, 0.000210766, 0.00124046, 0.000644926, -0, -0, -0, 0.000817565, -0.000709686, 0.000210766, 0.00124046, 0.000644926, -0, -0, -0, 0.000817565, -0.000709686, 0.000210766, 
pos_: 200
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0.00070754, 0.00021007, 0, 0, -0.000395355, -0.000157164, -0, 0.000815312, -0.00070754, 0.00021007, 0, 0, -0.000395355, -0.000157164, -0, 0.000815312, -0.00070754, 0.00021007, 0, 0, -0.000395355, -0.000157164, -0, 0.000815312, -0.00070754, 0.00021007, 0, 0, -0.000395355, -0.000157164, -0, 0.000815312, -0.00070754, 0.00021007, 0, 0, -0.000395355, -0.000157164, -0, 0.000815312, -0.00070754, 0.00021007, 0, 0, -0.000395355, -0.000157164, -0, 0.000815312, -0.00070754, 0.00021007, 
pos_: 250
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0.000704306, 0.00020873, 0, 0, -0.000393405, -0.000156468, -0, 0.000810378, -0.000704306, 0.00020873, 0, 0, -0.000393405, -0.000156468, -0, 0.000810378, -0.000704306, 0.00020873, 0, 0, -0.000393405, -0.000156468, -0, 0.000810378, -0.000704306, 0.00020873, 0, 0, -0.000393405, -0.000156468, -0, 0.000810378, -0.000704306, 0.00020873, 0, 0, -0.000393405, -0.000156468, -0, 0.000810378, -0.000704306, 0.00020873, 0, 0, -0.000393405, -0.000156468, -0, 0.000810378, -0.000704306, 0.00020873, 
pos_: 300
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0.000700939, 0.000205859, 0, 0, -0.000390786, -0.000155487, -0, 0.000802832, -0.000700939, 0.000205859, 0, 0, -0.000390786, -0.000155487, -0, 0.000802832, -0.000700939, 0.000205859, 0, 0, -0.000390786, -0.000155487, -0, 0.000802832, -0.000700939, 0.000205859, 0, 0, -0.000390786, -0.000155487, -0, 0.000802832, -0.000700939, 0.000205859, 0, 0, -0.000390786, -0.000155487, -0, 0.000802832, -0.000700939, 0.000205859, 0, 0, -0.000390786, -0.000155487, -0, 0.000802832, -0.000700939, 0.000205859, 
pos_: 350
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0.000698498, 0.000200735, 0, 0, -0.00038731, -0.000153948, -0, 0.000791532, -0.000698498, 0.000200735, 0, 0, -0.00038731, -0.000153948, -0, 0.000791532, -0.000698498, 0.000200735, 0, 0, -0.00038731, -0.000153948, -0, 0.000791532, -0.000698498, 0.000200735, 0, 0, -0.00038731, -0.000153948, -0, 0.000791532, -0.000698498, 0.000200735, 0, 0, -0.00038731, -0.000153948, -0, 0.000791532, -0.000698498, 0.000200735, 0, 0, -0.00038731, -0.000153948, -0, 0.000791532, -0.000698498, 0.000200735, 
pos_: 400
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0.000697815, 0.00019223, 0, 0, -0.00038332, -0.000151815, -0, 0.000775141, -0.000697815, 0.00019223, 0, 0, -0.00038332, -0.000151815, -0, 0.000775141, -0.000697815, 0.00019223, 0, 0, -0.00038332, -0.000151815, -0, 0.000775141, -0.000697815, 0.00019223, 0, 0, -0.00038332, -0.000151815, -0, 0.000775141, -0.000697815, 0.00019223, 0, 0, -0.00038332, -0.000151815, -0, 0.000775141, -0.000697815, 0.00019223, 0, 0, -0.00038332, -0.000151815, -0, 0.000775141, -0.000697815, 0.00019223, 
pos_: 450
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0.000699448, 0.000179388, 0, 0, -0.000378951, -0.000148882, -0, 0.000751864, -0.000699448, 0.000179388, 0, 0, -0.000378951, -0.000148882, -0, 0.000751864, -0.000699448, 0.000179388, 0, 0, -0.000378951, -0.000148882, -0, 0.000751864, -0.000699448, 0.000179388, 0, 0, -0.000378951, -0.000148882, -0, 0.000751864, -0.000699448, 0.000179388, 0, 0, -0.000378951, -0.000148882, -0, 0.000751864, -0.000699448, 0.000179388, 0, 0, -0.000378951, -0.000148882, -0, 0.000751864, -0.000699448, 0.000179388, 
pos_: 0
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0.00070315, 0.000161404, 0, 0, -0.000373599, -0.00014487, -0, 0.00071896, -0.00070315, 0.000161404, 0, 0, -0.000373599, -0.00014487, -0, 0.00071896, -0.00070315, 0.000161404, 0, 0, -0.000373599, -0.00014487, -0, 0.00071896, -0.00070315, 0.000161404, 0, 0, -0.000373599, -0.00014487, -0, 0.00071896, -0.00070315, 0.000161404, 0, 0, -0.000373599, -0.00014487, -0, 0.00071896, -0.00070315, 0.000161404, 0, 0, -0.000373599, -0.00014487, -0, 0.00071896, -0.00070315, 0.000161404, 
pos_: 50
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: -0.000704959, 0.000135543, 0, 0, -0.000365846, -0.000135219, -0, 0.000672773, -0.000707322, 0.000138147, 0, 0.000560951, -0.000366117, -0.00014138, -0, 0.000672789, -0.000716788, 0, 0, 0.000561913, -0.000370307, -0.000151356, -0.00112056, 0, -0.000709279, 0.000138814, 0.00107351, 0, -0.000367276, -0.000141807, -0.00112032, 0.000674741, -0.000724287, 0, 0, 0.000562625, -0.00036673, -0, -0.00111521, 0, -0.000708261, 0.00013459, 0.00108984, 0, -0.000362189, -0.000128634, -0, 0.00067906, -0.000713718, 0.000139715, 
F0508 16:39:46.888921  8736 softmax_loss_layer.cpp:68] Check failed: label_value >= 0 (-2147483648 vs. 0) 
*** Check failure stack trace: ***
Aborted (core dumped)


[...]

@longjon
Copy link
Contributor

longjon commented May 8, 2015

Ah, that's what I was looking for, forgot that those checks are only there in CPU mode. Have you checked, e.g., train_set_y.min(), train_set_y.max() right before calling set_input_arrays? (Does the problem still occur also with all-zero labels?)

Testing first is normal (unless you set the option test_initialization: false); it's done so that you have an initial loss to compare to.

@TJKlein
Copy link
Author

TJKlein commented May 8, 2015

Yes, I checked the input in python with min/max. It's fine. When using zero input, same problem:

Labels: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 0
size_: 784
n_: 200
batch_size: 50
--------------------------------------------
Labels: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 100
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 150
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 2.0367, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 200
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 2.0367, 1.0842e-19, 2.03457, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 250
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 2.0367, 1.0842e-19, 2.03457, -3.68935e+19, 2.03154, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 300
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 2.0367, 1.0842e-19, 2.03457, -3.68935e+19, 2.03154, 0, 2.02771, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 350
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 2.0367, 1.0842e-19, 2.03457, -3.68935e+19, 2.03154, 0, 2.02771, 3.68935e+19, 2.02235, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 400
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 2.0367, 1.0842e-19, 2.03457, -3.68935e+19, 2.03154, 0, 2.02771, 3.68935e+19, 2.02235, -2, 2.01512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 450
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 2.0367, 1.0842e-19, 2.03457, -3.68935e+19, 2.03154, 0, 2.02771, 3.68935e+19, 2.02235, -2, 2.01512, 3.68935e+19, 2.0061, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 0
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 2.0367, 1.0842e-19, 2.03457, -3.68935e+19, 2.03154, 0, 2.02771, 3.68935e+19, 2.02235, -2, 2.01512, 3.68935e+19, 2.0061, 0, 1.99555, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
pos_: 50
size_: 784
n_: 500
batch_size: 50
--------------------------------------------
Labels: 0, 2.03782, 0, 2.0367, 1.0842e-19, 2.03457, -3.68935e+19, 2.03154, 0, 2.02771, 3.68935e+19, 2.02235, -2, 2.01512, 3.68935e+19, 2.0061, 0, 1.99555, 0, 1.98344, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
F0508 17:07:42.873112  9129 softmax_loss_layer.cpp:68] Check failed: label_value >= 0 (-2147483648 vs. 0) 
*** Check failure stack trace: ***
Aborted (core dumped)

@TJKlein
Copy link
Author

TJKlein commented May 8, 2015

Interestingly enough data seems to be initialized correctly. When printing after reset() is being called:

500 labels: 
0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
200 labels: 
0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 

@TJKlein
Copy link
Author

TJKlein commented May 9, 2015

I am coming closer to the source of error. When I make the training set size smaller, let's say from 500 to 250 elements, it works without problem. Just a reminder I am working with images 28x28, so the amount of memory required is not very large and the machine(s) I use have sufficient (128 GByte).

@evolu8
Copy link

evolu8 commented May 9, 2015

How about 257 items? You see where I'm going...
On 9 May 2015 6:48 am, "Tassilo Klein" [email protected] wrote:

I am coming closer to the source of error. When I make the training set
size smaller, let's say from 500 to 250 elements, it works without problem.
Just a reminder I am working with images 28x28, so the amount of memory
required is not very large and the machine(s) I use have sufficient (128
GByte).


Reply to this email directly or view it on GitHub
#2334 (comment).

@TJKlein
Copy link
Author

TJKlein commented May 9, 2015

Well, I was using a batch size of 50 so far. So the next bigger one multiple of 50 is 300, which also gives me an error. However, when setting the batch size to 1, I also get an error with 250, although a different one (see below). I just noticed that I also happen to get this error with batch size of 50 and 250 training samples.

*** Error in `/home/tjk/anaconda/bin/python': double free or corruption (out): 0x0000000006856900 ***

Program received signal SIGABRT, Aborted.
0x00007ffff6d3ecc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6d3ecc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff6d420d8 in __GI_abort () at abort.c:89
#2  0x00007ffff6d7b394 in __libc_message (do_abort=do_abort@entry=1, 
    fmt=fmt@entry=0x7ffff6e89b28 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff6d8766e in malloc_printerr (ptr=<optimized out>, 
    str=0x7ffff6e89c58 "double free or corruption (out)", action=1) at malloc.c:4996
#4  _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3840
#5  0x00007fffea240a80 in caffe::CaffeFreeHost (ptr=0x6856900) at ./include/caffe/syncedmem.hpp:31
#6  0x00007fffea240552 in caffe::SyncedMemory::~SyncedMemory (this=0x62030a0, __in_chrg=<optimized out>)
    at src/caffe/syncedmem.cpp:11
#7  0x00007fffea263ccc in boost::checked_delete<caffe::SyncedMemory> (x=0x62030a0)
    at /usr/include/boost/checked_delete.hpp:34
#8  0x00007fffea263d60 in boost::detail::sp_counted_impl_p<caffe::SyncedMemory>::dispose (this=0x62030d0)
    at /usr/include/boost/smart_ptr/detail/sp_counted_impl.hpp:78
#9  0x00007fffeaf96902 in boost::detail::sp_counted_impl_p<caffe::Blob<float> >::dispose() ()
   from /home/tjk/caffe/python/caffe/_caffe.so
#10 0x00007fffeaf9e496 in std::vector<boost::shared_ptr<caffe::Blob<float> >, std::allocator<boost::shared_ptr<caffe::Blob<float> > > >::~vector() () from /home/tjk/caffe/python/caffe/_caffe.so
#11 0x00007fffea1c40b1 in caffe::Net<float>::~Net (this=0x64c63b0, __in_chrg=<optimized out>)
    at ./include/caffe/net.hpp:28
#12 0x00007fffea1c4394 in caffe::Net<float>::~Net (this=0x64c63b0, __in_chrg=<optimized out>)
    at ./include/caffe/net.hpp:28
#13 0x00007fffeaf9626e in boost::detail::sp_counted_base::release() ()
   from /home/tjk/caffe/python/caffe/_caffe.so
#14 0x00007fffeafba12c in caffe::SGDSolver<float>::~SGDSolver() ()
   from /home/tjk/caffe/python/caffe/_caffe.so
#15 0x00007fffeaf95759 in boost::python::objects::pointer_holder<boost::shared_ptr<caffe::SGDSolver<float> >, caffe::SGDSolver<float> >::~pointer_holder() () from /home/tjk/caffe/python/caffe/_caffe.so
#16 0x00007fffe950c907 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0
#17 0x00007ffff7aa4845 in subtype_dealloc (self=0x7fffbd69f418) at Objects/typeobject.c:1030
#18 0x00007ffff7a83047 in insertdict_by_entry (mp=0x7ffff7f67168, key=0x7ffff7e94cc0, 
    hash=<optimized out>, ep=<optimized out>, value=<optimized out>) at Objects/dictobject.c:519
#19 0x00007ffff7a8649c in insertdict (value=0x7ffff7da3a50 <_Py_NoneStruct>, hash=-3348361745023858585, 
    key=0x7ffff7e94cc0, mp=0x7ffff7f67168) at Objects/dictobject.c:556
#20 dict_set_item_by_hash_or_entry (value=0x7ffff7da3a50 <_Py_NoneStruct>, ep=0x0, 
    hash=-3348361745023858585, key=0x7ffff7e94cc0, op=0x7ffff7f67168) at Objects/dictobject.c:765
#21 PyDict_SetItem (op=0x7ffff7f67168, key=0x7ffff7e94cc0, value=0x7ffff7da3a50 <_Py_NoneStruct>)
    at Objects/dictobject.c:818
#22 0x00007ffff7a8976d in _PyModule_Clear (m=<optimized out>) at Objects/moduleobject.c:139
#23 0x00007ffff7b03def in PyImport_Cleanup () at Python/import.c:473
#24 0x00007ffff7b10ccb in Py_Finalize () at Python/pythonrun.c:459
#25 0x00007ffff7b26f75 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:670
#26 0x00007ffff6d29ec5 in __libc_start_main (main=0x400710 <main>, argc=2, argv=0x7fffffffdf68, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdf58)
    at libc-start.c:287
---Type <return> to continue, or q <return> to quit---
#27 0x0000000000400649 in _start ()

@TJKlein
Copy link
Author

TJKlein commented May 9, 2015

Interesting, when setting number of training elements to 257 with a batch size of 1, I also get the good old label error:

F0509 07:50:16.891880 13828 softmax_loss_layer.cpp:68] Check failed: label_value >= 0 (-2147483648 vs. 0) 
*** Check failure stack trace: ***
Aborted (core dumped)

@evolu8
Copy link

evolu8 commented May 9, 2015

and 255?


Phil Teare
evolu8 Ltd


On Sat, May 9, 2015 at 12:52 PM, Tassilo Klein [email protected]
wrote:

Interesting, when setting number of training elements to 257 with a batch
size of 1, I also get the good old label error:

F0509 07:50:16.891880 13828 softmax_loss_layer.cpp:68] Check failed: label_value >= 0 (-2147483648 vs. 0)
*** Check failure stack trace: ***
Aborted (core dumped)


Reply to this email directly or view it on GitHub
#2334 (comment).

@TJKlein
Copy link
Author

TJKlein commented May 9, 2015

That's more or less fine, just the

*** Error in `/home/tjk/anaconda/bin/python': double free or corruption (out): 0x00000000087f2000 ***
Aborted (core dumped)

@TJKlein
Copy link
Author

TJKlein commented May 9, 2015

I think it is a problem of boost python.

When calling set_input_arrays(...) which itself callsreset(..), and within printing the labels from memory on the console, everything is fine.
However, when the next thing I do is call forward(...) from python and print the labels from memory on the console, I already get invalid results (but only when the number of training/test samples exceeds 255 elements).

@TJKlein
Copy link
Author

TJKlein commented May 10, 2015

For the meantime I make a deep-copy of the array and then it works fine.

@evolu8
Copy link

evolu8 commented May 10, 2015

Good to hear.

If you've got working code I'd really appreciate seeing it. Any chance we
could get a complete script of your working 'deep copy' solution?


Phil Teare
evolu8 Ltd


On Sun, May 10, 2015 at 7:53 AM, Tassilo Klein [email protected]
wrote:

For the meantime I make a deep-copy of the array and then it works fine.


Reply to this email directly or view it on GitHub
#2334 (comment).

@TJKlein
Copy link
Author

TJKlein commented May 11, 2015

Sure, you can have a look. It's a just minor modifications.

https://github.com/TJKlein/caffe/commit/5f1bb97a587043dbe0892466b866abfe4c76804c

@ashudeep
Copy link

I have faced the same problem since a few days too. I also do believe that the problem is with the python part or/and the memory layer part since it works when I convert my data to a lmdb and run it via the terminal itself.

So @TJKlein, I made the line by line changes from your commit and did a 'make all'. All went fine until I ran 'make runtest' to get the following failed test:

[----------] 1 test from LayerFactoryTest/1, where TypeParam = caffe::DoubleCPU
[ RUN      ] LayerFactoryTest/1.TestCreateLayer
*** glibc detected *** .build_release/test/test_all.testbin: double free or corruption (!prev): 0x00000000074a1780 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x75be6)[0x2ac0075b2be6]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x2ac0075b798c]
/data/caffe/.build_release/test/../lib/libcaffe.so(_ZN5caffe15MemoryDataLayerIdED1Ev+0x42)[0x2ac00696e272]
/data/caffe/.build_release/test/../lib/libcaffe.so(_ZN5caffe15MemoryDataLayerIdED0Ev+0x9)[0x2ac00696e5b9]
.build_release/test/test_all.testbin[0x53e8ca]
.build_release/test/test_all.testbin[0x7d3faa]
.build_release/test/test_all.testbin[0x7c9839]
.build_release/test/test_all.testbin[0x7c9917]
.build_release/test/test_all.testbin[0x7c9a55]
.build_release/test/test_all.testbin(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x2f5)[0x7c9dc5]
.build_release/test/test_all.testbin[0x7d3b2a]
.build_release/test/test_all.testbin[0x7c90a1]
.build_release/test/test_all.testbin[0x4bdf22]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x2ac00755bead]
.build_release/test/test_all.testbin[0x4c3249]
======= Memory map: ========
00400000-008ff000 r-xp 00000000 08:08 17826924                           /data/caffe/.build_release/test/test_all.testbin
00afe000-00b4c000 rw-p 004fe000 08:08 17826924                           /data/caffe/.build_release/test/test_all.testbin
00b4c000-00b4d000 rw-p 00000000 00:00 0 
01e8d000-07726000 rw-p 00000000 00:00 0                                  [heap]
200000000-200100000 rw-s 55acc000 00:05 11997                            /dev/nvidiactl
....
<a long list here>
.....
make: *** [runtest] Aborted

So, are you sure the problem is really solved for you?
I haven't run it on a python example of a MemoryDataLayer network. Will do it very soon.

@TJKlein
Copy link
Author

TJKlein commented May 13, 2015

That's weird. For me it works well (on three different machines) and for me it passes all the tests. Technically, I am working on a different branch. But I don't see why this should have any effect on it.

@ashudeep
Copy link

Oh, so now I did a

make clean
make all -j8
make pycaffe
make runtest

and got no errors. Seems like the Makefile isn't that complete after all.
And now even my python scripts which set a large dataset as input arrays work!
Thanks for the neat commit for the solution.
@shelhamer Please consider merging it.

@grandbander
Copy link

Thanks for the fix. One minor issue I added to the fix, initializing data_ and labels_ to NULL in the constructor, then my tests passed. Your test passed because, luckily, your memory allocator initializes the memory to 0.

@mingtop
Copy link

mingtop commented Nov 22, 2015

I don't know why the label constrained to 1D .
Do you have any alter of solver.net.set_input_arrays(Xi, Yi)
Yi is 10_2_1_1
But I can only get a label of 10_1 .
------------some code of memorydatalayer -------------------
template
void MemoryDataLayer::DataLayerSetUp(const vector<Blob>& bottom,
const vector<Blob
>& top) {
batch_size_ = this->layer_param_.memory_data_param().batch_size();
channels_ = this->layer_param_.memory_data_param().channels();
height_ = this->layer_param_.memory_data_param().height();
width_ = this->layer_param_.memory_data_param().width();
size_ = channels_ * height_ * width_;
CHECK_GT(batch_size_ * size_, 0) <<
"batch_size, channels, height, and width must be specified and"
" positive in memory_data_param";
vector label_shape(1, batch_size_);
top[0]->Reshape(batch_size_, channels_, height_, width_);
top[1]->Reshape(label_shape);
added_data_.Reshape(batch_size_, channels_, height_, width_);
added_label_.Reshape(label_shape);
data_ = NULL;
labels_ = NULL;
added_data_.cpu_data();
added_label_.cpu_data();

}

top[1]->Reshape(label_shape); ???
my alter way is :
And can I set solver.net.set_input_arrays(Xi, Yi) two times by just copy layers in
train.prototxt file ? I tried but not work !
My question is can I solver.net.set_input_arrays(Xi, Yi) two times in my training time.

@baidut
Copy link

baidut commented Jan 6, 2016

I have faced the same problem when using Segnet to do Semantic Segmentation.
Check failed: status == CUBLAS_STATUS_SUCCESS (11 vs. 0) CUBLAS_STATUS_INTERNAL_ERROR.

The reason turns out to be mislabeled ground truth,
which should be 0 1
but what I give is 0 255
After I replace the ground truth image, it works perfectly.

@zeevikal
Copy link

hi @baidut , I have the same problem. i tried to change the class_weighting of the loss layer but it still dosent work. my dataset labels are grayscale (0 and 1). what am I doing wrong?

@baidut
Copy link

baidut commented Sep 16, 2016

hi @zeevikal,
If your label image format is *.jpg (lossy compression), try to use *.png(lossless compression) to store 0s and 1s.
In matlab:

I = imread('old_label.jpg');
imwrite(im2bw(I), 'new_label.png');

If not, your problem should not be related to the dataset label, please try other solutions.

@zeevikal
Copy link

hi @baidut , my label images are *.png (grayscale without alpha channel). I didn't understand "to store 0s and 1s." what do you mean? by the way, Im coding with python. thanks a lot!

@nidetaoge
Copy link

@baidut hi, I'm facing the same problem. I have checked that my groundtruth image is 0/1 image, and saved as .png, but when I use SoftmaxLoss I still face this problem, when I use SigmodCrossEntropy, problem is gone. can you help me about this problem?

@shelhamer
Copy link
Member

Closing according to #5528.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants