Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiver: Fix data race in writer (label set copying) #5492

Closed

Conversation

matej-g
Copy link
Collaborator

@matej-g matej-g commented Jul 12, 2022

Signed-off-by: Matej Gera [email protected]

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

This is continuation of fixes detected during work on #4664.

We appear to be hitting a data race due to some concurrent read / write in write request's labels. See output:

 WARNING: DATA RACE
 Read at 0x00c000cc4f00 by goroutine 40:
 runtime.racereadrange()
 <autogenerated>:1 +0x1b
 github.com/thanos-io/thanos/pkg/store/storepb.(*WriteRequest).Size()
 /go/src/github.com/thanos-io/thanos/pkg/store/storepb/rpc.pb.go:2011 +0x224
 github.com/thanos-io/thanos/pkg/store/storepb.(*WriteRequest).Marshal()
 /go/src/github.com/thanos-io/thanos/pkg/store/storepb/rpc.pb.go:1204 +0x30
 google.golang.org/protobuf/internal/impl.legacyMarshal()
 /go/pkg/mod/google.golang.org/[email protected]/internal/impl/legacy_message.go:404 +0xbe
 google.golang.org/protobuf/proto.MarshalOptions.marshal()
 /go/pkg/mod/google.golang.org/[email protected]/proto/encode.go:163 +0x3b9
 google.golang.org/protobuf/proto.MarshalOptions.MarshalAppend()
 /go/pkg/mod/google.golang.org/[email protected]/proto/encode.go:122 +0xa5
 github.com/golang/protobuf/proto.marshalAppend()
 /go/pkg/mod/github.com/golang/[email protected]/proto/wire.go:40 +0xe4
 github.com/golang/protobuf/proto.Marshal()
 /go/pkg/mod/github.com/golang/[email protected]/proto/wire.go:23 +0x67
 google.golang.org/grpc/encoding/proto.codec.Marshal()
 /go/pkg/mod/google.golang.org/[email protected]/encoding/proto/proto.go:45 +0x68
 google.golang.org/grpc/encoding/proto.(*codec).Marshal()
 <autogenerated>:1 +0x5b
 google.golang.org/grpc.encode()
 /go/pkg/mod/google.golang.org/[email protected]/rpc_util.go:594 +0x6a
 google.golang.org/grpc.prepareMsg()
 /go/pkg/mod/google.golang.org/[email protected]/stream.go:1603 +0x1ab
 google.golang.org/grpc.(*clientStream).SendMsg()
 /go/pkg/mod/google.golang.org/[email protected]/stream.go:784 +0x29b
 google.golang.org/grpc.invoke()
 /go/pkg/mod/google.golang.org/[email protected]/call.go:70 +0xfe
 github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.UnaryClientInterceptor.func1()
 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/[email protected]/interceptors/client.go:22 +0x29a
 github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1()
 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/[email protected]/chain.go:74 +0x112
 github.com/grpc-ecosystem/go-grpc-prometheus.(*ClientMetrics).UnaryClientInterceptor.func1()
 /go/pkg/mod/github.com/grpc-ecosystem/[email protected]/client_metrics.go:112 +0x133
 github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1()
 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/[email protected]/chain.go:74 +0x112
 github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1()
 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/[email protected]/chain.go:83 +0x188
 google.golang.org/grpc.(*ClientConn).Invoke()
 /go/pkg/mod/google.golang.org/[email protected]/call.go:35 +0x265
 github.com/thanos-io/thanos/pkg/store/storepb.(*writeableStoreClient).RemoteWrite()
 /go/src/github.com/thanos-io/thanos/pkg/store/storepb/rpc.pb.go:1124 +0xd4
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func4.2()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:643 +0x246
 github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
 /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:93 +0x135
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func4()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:641 +0x41d
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward·dwrap·9()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:674 +0x58
 Previous write at 0x00c000cc4f00 by goroutine 117:
 github.com/thanos-io/thanos/pkg/store/labelpb.ReAllocZLabelsStrings()
 /go/src/github.com/thanos-io/thanos/pkg/store/labelpb/label.go:50 +0x147
 github.com/thanos-io/thanos/pkg/receive.(*Writer).Write()
 /go/src/github.com/thanos-io/thanos/pkg/receive/writer.go:80 +0x2865
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func3.1()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:591 +0xfe
 github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
 /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:93 +0x135
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func3()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:590 +0x1ed
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward·dwrap·8()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:601 +0x58
 Goroutine 40 (running) created at:
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:607 +0xd67
 github.com/thanos-io/thanos/pkg/receive.(*Handler).replicate()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:748 +0x3d3
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func2.1()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:564 +0xda
 github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
 /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:93 +0x135
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func2()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:563 +0x1cf
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward·dwrap·7()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:574 +0x58
 Goroutine 117 (running) created at:
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:586 +0x1078
 github.com/thanos-io/thanos/pkg/receive.(*Handler).replicate()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:748 +0x3d3
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func2.1()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:564 +0xda
 github.com/thanos-io/thanos/pkg/tracing.DoInSpan()
 /go/src/github.com/thanos-io/thanos/pkg/tracing/tracing.go:93 +0x135
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward.func2()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:563 +0x1cf
 github.com/thanos-io/thanos/pkg/receive.(*Handler).fanoutForward·dwrap·7()
 /go/src/github.com/thanos-io/thanos/pkg/receive/handler.go:574 +0x58

Based on the logs, my assumption is that in the handler, we are re-using the *prompb.WriteRequest in the fanoutForward method - once for the local node Write and then for the remote forward in RemoteWrite. It seems that doing some reallocation shuffling with the labels can interfere with gRPC message marshalling, as it seem to be re-using the same underlying memory.

The suggestion here is to make a short-lived separate copy of the label set (if ref is not available) before proceeding with the write on the local node. This should ensure we are no longer writing to the same object as we're reading from during marshalling.

I'm not 100% not sure of performance implications though, seeing this is the main write path for receive.

Verification

Rate the E2E reciever tests on top of #5066 with no data race being detected anymore.

@matej-g matej-g force-pushed the fix-data-race-receive-writer-labels branch from 8cc31b1 to 0743b8d Compare July 29, 2022 12:52
@matej-g
Copy link
Collaborator Author

matej-g commented Jul 29, 2022

I also added benchmark for realloc version with DeepCopy it seems to be slightly slower than current version:

go test -benchmem -benchtime=10s -run=^$ -tags integration -bench ^BenchmarkTransformWithAndWithoutCopy           fix-data-race-receive-writer-labels
goos: linux
goarch: amd64
pkg: github.com/thanos-io/thanos/pkg/store/labelpb
cpu: Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
BenchmarkTransformWithAndWithoutCopy/ZLabelsToPromLabels-12             1000000000               1.535 ns/op           0 B/op          0 allocs/op
BenchmarkTransformWithAndWithoutCopy/ZLabelsToPromLabelsWithRealloc-12               211          57434846 ns/op        96000000 B/op    2000000 allocs/op
BenchmarkTransformWithAndWithoutCopy/ZLabelsToPromLabelsWithDeepCopy-12              190          61866060 ns/op        128006144 B/op   2000001 allocs/op
PASS
ok      github.com/thanos-io/thanos/pkg/store/labelpb   41.895s

@matej-g matej-g force-pushed the fix-data-race-receive-writer-labels branch from 0743b8d to a78cf81 Compare July 29, 2022 13:02
@matej-g matej-g marked this pull request as ready for review July 29, 2022 13:02
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, where we modify them? Maybe instead we could stop modifying in place?

My worry is that the same hidden modification might be done if ref =! 0 - we have no mean to verify it. And IF that happens I really we don't copy all labels - I don't think all of them are changed.

lset = labelpb.ZLabelsToPromLabels(t.Labels)
//
// Nevertheless, we have to make a deep copy, to prevent manipulating the request
// labels (these are re-used among different requests and could trigger data race).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where we change those?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in the description above:

Based on the logs, my assumption is that in the handler, we are re-using the *prompb.WriteRequest in the fanoutForward method - once for the local node Write and then for the remote forward in RemoteWrite. It seems that doing some reallocation shuffling with the labels can interfere with gRPC message marshalling, as it seem to be re-using the same underlying memory.

I also pasted the output when data race is detected.

@douglascamata
Copy link
Contributor

douglascamata commented Aug 4, 2022

The results seem to show quite a big difference (from 1.535 ns/op to 61866060 ns/op, which means 61ms), but only reading code above we can see that's the use case for 1000000 labels, which is quite extreme. Also the realloc itself is like 57ms...

I suspect the performance difference for scenarios like 50 labels or so would be unnoticeable. There might still be interesting change in a high throughput scenario.

@douglascamata
Copy link
Contributor

What if we protect the realloc for local write with a mutex and check it before the fanout gRPC writes? 🤔

Copy link
Contributor

@douglascamata douglascamata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. From what I understand, the deepcopy added only 4ms (total of 61ms) on top of the realloc (total of 57ms) in the benchmark. Seems like a much more appealing solution than a mutex.

@bwplotka
Copy link
Member

bwplotka commented Aug 9, 2022

I am not fully bought on this honestly, need to look deeper 🙈

@stale
Copy link

stale bot commented Oct 30, 2022

Hello 👋 Looks like there was no activity on this amazing PR for the last 30 days.
Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Oct 30, 2022
@stale
Copy link

stale bot commented Nov 12, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this Nov 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants