storage: Raft gRPC unmarshal errors #7258

cuongdo · 2016-06-16T01:05:29Z

When running the 1->3 allocator test (see cockroachdb/cockroach-prod#89), I see one of the following log entries every couple of seconds:

I160616 00:51:50.965263 storage/raft_transport.go:261 remote node 2 at cdocr-small-1to3-cockroach-2:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message proto: illegal wireType 6

I160616 00:50:45.650723 storage/raft_transport.go:261 remote node 3 at cdocr-small-1to3-cockroach-3:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message proto: illegal wireType 7

I160616 00:52:00.180063 storage/raft_transport.go:261 remote node 2 at cdocr-small-1to3-cockroach-2:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message unexpected EOF

I160616 00:51:37.233888 storage/raft_transport.go:261 remote node 2 at cdocr-small-1to3-cockroach-2:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message proto: Snapshot: wiretype end group for non-group

I160616 00:50:32.026801 storage/raft_transport.go:261 remote node 3 at cdocr-small-1to3-cockroach-3:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message proto: Snapshot: illegal tag -1620430216 (wire type 21396296641)

cc @tamird & @tschottdorf who might have extra context

This is happening as recently as 3acc834 and is unaffected by #7252, another fix to (*RaftTransport).processQueue().

The text was updated successfully, but these errors were encountered:

bdarnell · 2016-06-16T19:58:46Z

Could it be a thread-safety issue? We may start up multiple goroutines writing to the same snapshot stream. This was fine in the original version where each async task started a new stream, but now that we reuse a stream we may need a mutex or other synchronization.

tamird · 2016-06-16T20:18:38Z

Yep, I bet this is it. I'll refactor the transport to use a separate queue for snapshots, which should make everything simpler here.

cuongdo · 2016-06-16T20:28:24Z

I later re-ran my test with a binary built using the -race flag. There
were no panics despite the logs containing many unmarshal errors. Wouldn't
there have been a panic if this was a data race?

On Thu, Jun 16, 2016 at 4:18 PM Tamir Duberstein [email protected]
wrote:

Yep, I bet this is it. I'll refactor the transport to use a separate queue
for snapshots, which should make everything simpler here.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#7258 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABffptn1PLzNSDT3KKFh1tt62kAykd27ks5qMa-mgaJpZM4I2767
.

tamird · 2016-06-16T20:35:51Z

The race detector is also not infallible. This might not be a "real" data race; perhaps everything is data-safe, but results in interleaving of data on the wire.

cuongdo self-assigned this Jun 16, 2016

cuongdo changed the title ~~storage: gRPC unmarshal errors~~ storage: Raft gRPC unmarshal errors Jun 16, 2016

tamird self-assigned this Jun 16, 2016

tamird mentioned this issue Jun 16, 2016

storage/raft_transport: use a separate queue for each message type #7299

Merged

tamird closed this as completed in e4be3d8 Jun 17, 2016

tamird mentioned this issue Jun 20, 2016

storage: 1->3 node replica repair progresses very slowly #7161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Raft gRPC unmarshal errors #7258

storage: Raft gRPC unmarshal errors #7258

cuongdo commented Jun 16, 2016

bdarnell commented Jun 16, 2016

tamird commented Jun 16, 2016

cuongdo commented Jun 16, 2016

tamird commented Jun 16, 2016

storage: Raft gRPC unmarshal errors #7258

storage: Raft gRPC unmarshal errors #7258

Comments

cuongdo commented Jun 16, 2016

bdarnell commented Jun 16, 2016

tamird commented Jun 16, 2016

cuongdo commented Jun 16, 2016

tamird commented Jun 16, 2016