Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Raft gRPC unmarshal errors #7258

Closed
cuongdo opened this issue Jun 16, 2016 · 4 comments
Closed

storage: Raft gRPC unmarshal errors #7258

cuongdo opened this issue Jun 16, 2016 · 4 comments
Assignees

Comments

@cuongdo
Copy link
Contributor

cuongdo commented Jun 16, 2016

When running the 1->3 allocator test (see cockroachdb/cockroach-prod#89), I see one of the following log entries every couple of seconds:

I160616 00:51:50.965263 storage/raft_transport.go:261 remote node 2 at cdocr-small-1to3-cockroach-2:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message proto: illegal wireType 6

I160616 00:50:45.650723 storage/raft_transport.go:261 remote node 3 at cdocr-small-1to3-cockroach-3:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message proto: illegal wireType 7

I160616 00:52:00.180063 storage/raft_transport.go:261 remote node 2 at cdocr-small-1to3-cockroach-2:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message unexpected EOF

I160616 00:51:37.233888 storage/raft_transport.go:261 remote node 2 at cdocr-small-1to3-cockroach-2:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message proto: Snapshot: wiretype end group for non-group

I160616 00:50:32.026801 storage/raft_transport.go:261 remote node 3 at cdocr-small-1to3-cockroach-3:26257 closed Raft transport with error: rpc error: code = 13 desc = grpc: failed to unmarshal the received message proto: Snapshot: illegal tag -1620430216 (wire type 21396296641)

cc @tamird & @tschottdorf who might have extra context

This is happening as recently as 3acc834 and is unaffected by #7252, another fix to (*RaftTransport).processQueue().

@cuongdo cuongdo self-assigned this Jun 16, 2016
@cuongdo cuongdo changed the title storage: gRPC unmarshal errors storage: Raft gRPC unmarshal errors Jun 16, 2016
@bdarnell
Copy link
Contributor

Could it be a thread-safety issue? We may start up multiple goroutines writing to the same snapshot stream. This was fine in the original version where each async task started a new stream, but now that we reuse a stream we may need a mutex or other synchronization.

@tamird
Copy link
Contributor

tamird commented Jun 16, 2016

Yep, I bet this is it. I'll refactor the transport to use a separate queue for snapshots, which should make everything simpler here.

@tamird tamird self-assigned this Jun 16, 2016
@cuongdo
Copy link
Contributor Author

cuongdo commented Jun 16, 2016

I later re-ran my test with a binary built using the -race flag. There
were no panics despite the logs containing many unmarshal errors. Wouldn't
there have been a panic if this was a data race?

On Thu, Jun 16, 2016 at 4:18 PM Tamir Duberstein [email protected]
wrote:

Yep, I bet this is it. I'll refactor the transport to use a separate queue
for snapshots, which should make everything simpler here.


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#7258 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABffptn1PLzNSDT3KKFh1tt62kAykd27ks5qMa-mgaJpZM4I2767
.

@tamird
Copy link
Contributor

tamird commented Jun 16, 2016

The race detector is also not infallible. This might not be a "real" data race; perhaps everything is data-safe, but results in interleaving of data on the wire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants