Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't serialize when protobuf size > 2gb #2006

Closed
ghost opened this issue Feb 28, 2015 · 6 comments
Closed

Can't serialize when protobuf size > 2gb #2006

ghost opened this issue Feb 28, 2015 · 6 comments

Comments

@ghost
Copy link

ghost commented Feb 28, 2015

I'm getting segfaults from write proto with large models. It seems that protobuf simply doesn't support writing messages greater than 2gb (marbl/harvest-tools#3). Does anyone have a workaround they would suggest? It would be nice if we added a check on the message size to give a nicer error message if nothing else.

Here are the details:

I0228 14:22:15.306717 23672 solver.cpp:355] Snapshotting to /snapshots/caffe_iter_10.caffemodel

Program received signal SIGSEGV, Segmentation fault.
0x000000000047fa74 in caffe::LayerParameter::GetCachedSize (
this=0x903d621f263c2284) at .build_debug/src/caffe/proto/caffe.pb.h:2063

The backtrace from gdb:
#0 0x000000000047fa74 in caffe::LayerParameter::GetCachedSize (

this=0x903d621f263c2284) at .build_debug/src/caffe/proto/caffe.pb.h:2063

#1 0x000000000048cedd in google::protobuf::internal::WireFormatLite::WriteMessageNoVirtualToArraycaffe::LayerParameter (field_number=2, value=...,

target=0x2305c18b "<\v\a\237\274>\025\254\274r\376\245<g̪<\003`\332<n\345\061=\241\374\202\273>ֶ\274J2|=\201\300\326\067\203\211.=\212\035\371\274\224\337B\275\025&ɼ") at /usr/include/google/protobuf/wire_format_lite_inl.h:708

#2 0x0000000000433cbe in caffe::NetParameter::SerializeWithCachedSizesToArray

(this=0x7fffffffd160, 
target=0x2305c18a "\022<\v\a\237\274>\025\254\274r\376\245<g̪<\003`\332<n\345\061=\241\374\202\273>ֶ\274J2|=\201\300\326\067\203\211.=\212\035\371\274\224\337B\275\025&ɼ") at .build_debug/src/caffe/proto/caffe.pb.cc:3530

#3 0x00007ffff2094d0c in google::protobuf::MessageLite::SerializePartialToCodedStream(google::protobuf::io::CodedOutputStream*) const ()

from /usr/lib/x86_64-linux-gnu/libprotobuf.so.8
#4 0x00007ffff2094dc5 in google::protobuf::MessageLite::SerializeToCodedStream(google::protobuf::io::CodedOutputStream*) const ()

from /usr/lib/x86_64-linux-gnu/libprotobuf.so.8
#5 0x00007ffff2094f01 in google::protobuf::MessageLite::SerializeToZeroCopyStream(google::protobuf::io::ZeroCopyOutputStream*) const ()

from /usr/lib/x86_64-linux-gnu/libprotobuf.so.8
#6 0x00007ffff20ea20b in google::protobuf::Message::SerializeToOstream(std::ostream*) const () from /usr/lib/x86_64-linux-gnu/libprotobuf.so.8

---Type to continue, or q to quit---
#7 0x00000000004a98fe in caffe::WriteProtoToBinaryFile (proto=...,

filename=0x193f2c68 "/snapshots/caffe_iter_10.caffemodel")
at src/caffe/util/io.cpp:66

#8 0x0000000000497249 in caffe::Solver::Snapshot (this=0x50f9e90)

at src/caffe/solver.cpp:356
@shelhamer
Copy link
Member

Yeah -- this is a limitation commented on #1756. Two methods suggested there are

  1. serialize parameters into separate binary proto
  2. switch to hdf5 for parameter serialization
  3. take a closer look at Cap'n Proto Cap'n Proto #1762

(1) could be a quick workaround if you need it while (2) seems more sustainable.

@shelhamer shelhamer changed the title Caffe segfaults when protobuf size > 2gb Can't serialize when protobuf size > 2gb Feb 28, 2015
@ghost
Copy link
Author

ghost commented Mar 1, 2015

Okay, #1756 leaves something to be desired in terms of checks at the time of writing a blob, which seems to be the more likely time that you would encounter this issue. Message& types have a ByteSize() attribute that could be checked, although it's computation is not free and its result is not safe from overflow > 2gb. (2) and (3) would work, but somehow migrating away from protobufs seems a little extreme given how central they are to the caffe ecosystem. For now, I stopped serializing layers containing duplicate shared parameter blobs, which decreased my writes by a factor of 4 and brought me under the limit. Thanks for the quick reply. It helped a lot in my decision making. Somehow I missed those other issues despite googling several times.

@ghost ghost closed this as completed Mar 1, 2015
@shelhamer
Copy link
Member

Right, there is an open issue for weight sharing polish that includes
saving / loading / filling only the owner of shared weights as you
highlighted too. Send a PR if you can and then we'll check that off.
On Sat, Feb 28, 2015 at 16:59 Russell Stewart [email protected]
wrote:

Closed #2006 #2006.


Reply to this email directly or view it on GitHub
#2006 (comment).

@ghost
Copy link
Author

ghost commented Apr 3, 2015

@shelhamer Okay, I've opened up some of my code here: https://github.com/Russell91/nlp_caffe

There's really too much to submit as a single pull request, but many of the weight sharing issues have been solved by accumulating diffs in a master buffer as you go through the backward() calls. If you take a look and let me know what you would want to see in a pull request to Dev, I'll try and put something together.

@bhack
Copy link
Contributor

bhack commented Apr 3, 2015

@shelhamer Have you already evaluated Google flatbuffers?

@shelhamer
Copy link
Member

Closing as solved by #2836 although it is not yet the default due to the issue raised by #2885.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants