-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rangefeed: improve memory accounting for event queue #120347
Conversation
d8eabab
to
ecc404a
Compare
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
randomlyChosenEvent string, | ||
randomlyChosenLogicalOp string, | ||
) { | ||
// Opt out sst event since it requires testing.T to call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark randomly generates lots of events, and every types of events with equal probability which is not true in production (for example, initRTS should happen much less frequent than logicalOps). I'm will improve the benchmark again with a weighted rand distribution soon.
416a9f8
to
a11824c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/rangefeed/event_size.go
line 240 at r2 (raw file):
// data and was already accounted in MVCCCommitIntentOp. currMemUsage := mvccCommitIntentOp currMemUsage += int64(cap(txnID))
I'm assuming any underlying data here is non-overlapping. We might be over-accounting if some keys or values point at the same underlying data. But I don't think there is an easy way to track and avoid this.
pkg/kv/kvserver/rangefeed/event_size.go
line 352 at r2 (raw file):
// event will be created. futureMemUsage += rangefeedCheckpointOpMemUsage() * int64(len(ops)) return currMemUsage, futureMemUsage
In theory, max(currMemUsage, futureMemUsage) for each op could be pre-determined given they share the same underlying data and the only diff lies between the constant base struct values. The rationale behind the current approach is: instead of picking max(currMemUsage, futureMemUsage) for every op, we want to compare the current states before events are published VS the future state after all events are published. This might be different from picking max(currMemUsage, futureMemUsage) for each op. There could be some middle state where first logicalOp has been published but the second one hasn't, but I'm ignoring for now since the middle state soonly ends after this consumeEvent finishes. I'm also assuming publishToOverlapping cannot called concurrently (which seems true as of now) since otherwise there could be multiple sharedEvents simultaneously.
ca3f5e4
to
79323d7
Compare
Upon more reflection, my initial observation about only needing to account memory for one sharedEvent base struct for registrations feels off. We end up make copies of sharedEvent, and they might all remain buffered in the registration buf channel. The memory for the base struct is released only after registration sends it. I'm making another PR which has a cleaner design and handles this. It pre-allocates for all current + future events base structs plus its underlying data. Underlying data does not get released until the very end, but we gradually release memory based on whether checkpoint events are published or if current event struts is gone. Lmk which one you would prefer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/rangefeed/event_size.go
line 49 at r1 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Upon more reflection, my initial observation about only needing to account memory for one sharedEvent base struct for registrations feels off. We end up make copies of sharedEvent, and they might all remain buffered in the registration buf channel. The memory for the base struct is released only after registration sends it https://github.com/cockroachdb/cockroach/blob/a7586d1e6d26eda5f9dad05f76396263632ef265/pkg/kv/kvserver/rangefeed/registry.go#L341. I will add a second commit for this and squash if you agree.
You're right that we create a new sharedEvent
copy for every registration, but I think we can ignore it. A sharedEvent
is only 16 bytes (two pointers), so the overhead is fairly small.
It's also worth considering that a buffered channel will use memory as if it was full regardless of the number of events that it contains -- in other words, a chan(*sharedEvent, 4096)
will use 8*4096 bytes of memory (32 KB) even when it's empty. When we fill it up, we have (8+16)*4096 bytes (96 KB) to account for the actual sharedEvent
structs. If we wanted to, we could cut this to a fixed 16*4096 bytes
(64 KB) by dropping the pointer and using chan(sharedEvent, 4096)
, which would eliminate half of the overhead of the sharedEvent
struct, but I don't think it matters all that much. Each registration also has a lot of other overhead like the goroutine stack and gRPC buffers and such which is likely larger than this and which isn't accounted for at all.
This is all to say that I don't think we should add a lot of complexity here just to account for these tiny sharedEvents
. If we later find that we need it we can deal with it then.
pkg/kv/kvserver/rangefeed/event_size.go
line 240 at r2 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
I'm assuming any underlying data here is non-overlapping. We might be over-accounting if some keys or values point at the same underlying data. But I don't think there is an easy way to track and avoid this.
This is typically decoded from the Raft log before it's handed to us, so it will be individual copies.
pkg/kv/kvserver/rangefeed/event_size.go
line 352 at r2 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
In theory, max(currMemUsage, futureMemUsage) for each op could be pre-determined given they share the same underlying data and the only diff lies between the constant base struct values. The rationale behind the current approach is: instead of picking max(currMemUsage, futureMemUsage) for every op, we want to compare the current states before events are published VS the future state after all events are published. This might be different from picking max(currMemUsage, futureMemUsage) for each op. There could be some middle state where first logicalOp has been published but the second one hasn't, but I'm ignoring for now since the middle state soonly ends after this consumeEvent finishes. I'm also assuming publishToOverlapping cannot called concurrently (which seems true as of now) since otherwise there could be multiple sharedEvents simultaneously.
I'm not sure we necessarily need to split this into currMemUsage
and futureMemUsage
. From a quick glance at the events, it seems like an event
will typically be larger than the resulting set of RangeFeedEvent
, so computing that alone will be sufficient? And even if it isn't, the difference shouldn't be large enough to really matter? The former will be garbage collected once the latter has been produced.
Let's consider the two representative examples: a batch of 10 writes, and a checkpoint.
A batch of 10 writes is:
_ = event{
ops: opsEvent{
enginepb.MVCCLogicalOp{
WriteValue: &enginepb.MVCCWriteValueOp{
Key: roachpb.Key("foo"),
Timestamp: hlc.Timestamp{},
Value: []byte("value"),
PrevValue: []byte("prev"),
OmitInRangefeeds: false,
},
WriteIntent: nil,
UpdateIntent: nil,
CommitIntent: nil,
AbortIntent: nil,
AbortTxn: nil,
DeleteRange: nil,
},
// ...repeated 9 times
},
ct: ctEvent{},
initRTS: initRTSEvent(false),
sst: nil,
sync: nil,
alloc: &SharedBudgetAllocation{},
}
_ = &kvpb.RangeFeedEvent{
Val: &kvpb.RangeFeedValue{
Key: roachpb.Key("foo"),
Value: roachpb.Value{Timestamp: hlc.Timestamp{}, RawBytes: []byte("value")},
PrevValue: roachpb.Value{Timestamp: hlc.Timestamp{}, RawBytes: []byte("prev")},
},
Checkpoint: nil,
Error: nil,
SST: nil,
DeleteRange: nil,
}
// ...repeated 9 times
Just by eyeballing it, we can tell that the event
uses more memory than RangeFeedEvent
, because the structs are larger and contain more stuff. If you print the output from your computations, I'm sure you'll find the same result. This also applies for checkpoints, compare:
_ = event{
ops: nil,
ct: ctEvent{Timestamp: hlc.Timestamp{WallTime: 1, Logical: 0}},
initRTS: initRTSEvent(false),
sst: nil,
sync: nil,
alloc: &SharedBudgetAllocation{},
}
_ = &kvpb.RangeFeedEvent{
Val: nil,
Checkpoint: &kvpb.RangeFeedCheckpoint{
// Span just contains references to the rangefeed's span, so two pointers = 16 bytes.
Span: roachpb.Span{Key: roachpb.Key("a"), EndKey: roachpb.Key("b")},
ResolvedTS: hlc.Timestamp{WallTime: 1, Logical: 0},
},
Error: nil,
SST: nil,
DeleteRange: nil,
}
Where I think this ends up using about the same memory -- or it's otherwise only off by a few bytes.
So I think we should keep this simple and just compute the size of the event
, with the assumption that the resulting Protobuf events will generally be smaller or roughly the same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/rangefeed/event_size.go
line 49 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
You're right that we create a new
sharedEvent
copy for every registration, but I think we can ignore it. AsharedEvent
is only 16 bytes (two pointers), so the overhead is fairly small.It's also worth considering that a buffered channel will use memory as if it was full regardless of the number of events that it contains -- in other words, a
chan(*sharedEvent, 4096)
will use 8*4096 bytes of memory (32 KB) even when it's empty. When we fill it up, we have (8+16)*4096 bytes (96 KB) to account for the actualsharedEvent
structs. If we wanted to, we could cut this to a fixed16*4096 bytes
(64 KB) by dropping the pointer and usingchan(sharedEvent, 4096)
, which would eliminate half of the overhead of thesharedEvent
struct, but I don't think it matters all that much. Each registration also has a lot of other overhead like the goroutine stack and gRPC buffers and such which is likely larger than this and which isn't accounted for at all.This is all to say that I don't think we should add a lot of complexity here just to account for these tiny
sharedEvents
. If we later find that we need it we can deal with it then.
Ack. Happy to just keep it simple then. I will keep the other PR as a draft PR in case we want to revisit this in the future.
pkg/kv/kvserver/rangefeed/event_size.go
line 352 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I'm not sure we necessarily need to split this into
currMemUsage
andfutureMemUsage
. From a quick glance at the events, it seems like anevent
will typically be larger than the resulting set ofRangeFeedEvent
, so computing that alone will be sufficient? And even if it isn't, the difference shouldn't be large enough to really matter? The former will be garbage collected once the latter has been produced.Let's consider the two representative examples: a batch of 10 writes, and a checkpoint.
A batch of 10 writes is:
_ = event{ ops: opsEvent{ enginepb.MVCCLogicalOp{ WriteValue: &enginepb.MVCCWriteValueOp{ Key: roachpb.Key("foo"), Timestamp: hlc.Timestamp{}, Value: []byte("value"), PrevValue: []byte("prev"), OmitInRangefeeds: false, }, WriteIntent: nil, UpdateIntent: nil, CommitIntent: nil, AbortIntent: nil, AbortTxn: nil, DeleteRange: nil, }, // ...repeated 9 times }, ct: ctEvent{}, initRTS: initRTSEvent(false), sst: nil, sync: nil, alloc: &SharedBudgetAllocation{}, } _ = &kvpb.RangeFeedEvent{ Val: &kvpb.RangeFeedValue{ Key: roachpb.Key("foo"), Value: roachpb.Value{Timestamp: hlc.Timestamp{}, RawBytes: []byte("value")}, PrevValue: roachpb.Value{Timestamp: hlc.Timestamp{}, RawBytes: []byte("prev")}, }, Checkpoint: nil, Error: nil, SST: nil, DeleteRange: nil, } // ...repeated 9 timesJust by eyeballing it, we can tell that the
event
uses more memory thanRangeFeedEvent
, because the structs are larger and contain more stuff. If you print the output from your computations, I'm sure you'll find the same result. This also applies for checkpoints, compare:_ = event{ ops: nil, ct: ctEvent{Timestamp: hlc.Timestamp{WallTime: 1, Logical: 0}}, initRTS: initRTSEvent(false), sst: nil, sync: nil, alloc: &SharedBudgetAllocation{}, } _ = &kvpb.RangeFeedEvent{ Val: nil, Checkpoint: &kvpb.RangeFeedCheckpoint{ // Span just contains references to the rangefeed's span, so two pointers = 16 bytes. Span: roachpb.Span{Key: roachpb.Key("a"), EndKey: roachpb.Key("b")}, ResolvedTS: hlc.Timestamp{WallTime: 1, Logical: 0}, }, Error: nil, SST: nil, DeleteRange: nil, }Where I think this ends up using about the same memory -- or it's otherwise only off by a few bytes.
So I think we should keep this simple and just compute the size of the
event
, with the assumption that the resulting Protobuf events will generally be smaller or roughly the same size.
I ran some more benchmark.
If we just want to pick the maximum between current and future events, we can say:
- For logical ops, current event memory is always larger if we assume no checkpoint events will be published for any logical ops. (I think you mentioned that these checkpoint events are very minimal.) (same underlying data)
- For checkpoint events (ct and initRTS), future event is larger. (curr: 80, future: 152) (span points to the same underlying data)
- For sst event, future base structs is larger than current base structs by 2 (same underlying data).
- For sync event, current event is always larger as no future events will be created.
I'm happy to just pick the max between current and future and take that as an estimation to keep things simple. We can revisit this in the future if anything comes up.
3cc018e
to
d6352da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r1, 1 of 3 files at r2, 3 of 3 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/rangefeed/event_size.go
line 33 at r6 (raw file):
) const (
nit: use a single const
block for these.
pkg/kv/kvserver/rangefeed/event_size.go
line 208 at r6 (raw file):
currMemUsage := mvccLogicalOp * int64(cap(ops)) for _, op := range ops { switch t := op.GetValue().(type) {
Let's add a default clause that fatals, so we catch any cases that might be added later.
pkg/kv/kvserver/rangefeed/event_size.go
line 243 at r6 (raw file):
// currMemUsage: pointer to e is passed to p.eventC channel. Each e pointer is // &event{}, and each pointer points at an underlying event{}. switch {
Let's add a default clause here as well that fatals.
pkg/kv/kvserver/rangefeed/scheduled_processor.go
line 702 at r6 (raw file):
// move forward. If so, publish a RangeFeedCheckpoint notification. if p.rts.ConsumeLogicalOp(ctx, op) { log.Infof(ctx, "memory accounting inaccurate: %d bytes slipped through", rangefeedCheckpointOpMemUsage())
Let's remove this, I don't think operators care about this at all and it will just confuse them.
pkg/kv/kvserver/rangefeed/event_size_test.go
line 194 at r6 (raw file):
Timestamp: testTs, }) expectedMemUsage += mvccDeleteRangeOp + int64(cap(startKey)) + int64(cap(endKey))
We should make sure some tests here will catch changes to the structs. Right now, if someone adds e.g. a new []byte
field, the test will still pass, but we're not accounting for it. We should have a test that fails to remind people to update the memory account here. You could consider using something like echotest
for this, or some other mechanism that you prefer -- your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/rangefeed/event_size_test.go
line 194 at r6 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
We should make sure some tests here will catch changes to the structs. Right now, if someone adds e.g. a new
[]byte
field, the test will still pass, but we're not accounting for it. We should have a test that fails to remind people to update the memory account here. You could consider using something likeechotest
for this, or some other mechanism that you prefer -- your call.
I want to get this in before the branch cut, but I created a GitHub issue tracking this. I'm planning to work on it next week.
pkg/kv/kvserver/rangefeed/processor_test.go
line 821 at r7 (raw file):
func TestProcessorMemoryBudgetReleased(t *testing.T) { defer leaktest.AfterTest(t)() fb := newTestBudget(250)
@erikgrinaker Do you mind taking a quick look at this change I just made? I missed a failing test and would like to confirm my approach. Initially, the test's budget was 40 which was enough because in-flight message was only 16 bytes. With my changes, each inflight message is now 238 bytes which caused this test to time out and not have enough budget to send the events. I adjusted the budget to 250 bytes here which I believe matches the original intent of this test - ensuring inflight messages are within the budget, and memory allocated for in-flight messages are properly released.
To prevent OOMs, we previously introduced memory budget to limit event buffering to processor.eventC channel. These events could point to large underlying data structure and keeping them in the channel prevent them from being garbage collected. However, the current memory accounting did not have a good way to predict the memory footprint of events. It uses calculateDateSize which 1. does not account for memory of the new RangeFeedEvents that could be spawned (such as checkpoint events) 2. inaccurately uses protobuf-encoded wire size for data 3. does not account for memory of the base structs. This patch improves the memory estimation by resolving the three points above. It tries its best effort to account for base struct's memory, predict generation of new events, and use actual data size rather than compressed protobuf size. Since it is challenging to predict whether there would be a new checkpoint event, our strategy is: Instead of accounting memory for both the current event and future rangefeed events, it will always pick the maximum memory between the two. It also allows additional shared event overhead (should be very small) and checkpoint events (caused by logical ops) (should be rare) to slip away. - For logical ops, we always take current event memory. Current event memory is always larger if we assume no checkpoint events will be published for any logical ops. (same underlying data) - For checkpoint events (ct and initRTS), we always take checkpoint rangefeed event. Future events are larger. (curr: 80, future: 152) (span points to the same underlying data) - For sst events, we always take future rangefeed event (future base structs is larger than current base structs by 2) (same underlying data). - For sync events, we always take current events. (current memory usage is larger as no future events will be created. We got these data ^ from test benchmarks documented here cockroachdb#120582. Resolves: cockroachdb#114190 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r7, 3 of 3 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/rangefeed/event_size_test.go
line 194 at r6 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
I want to get this in before the branch cut, but I created a GitHub issue tracking this. I'm planning to work on it next week.
👍
pkg/kv/kvserver/rangefeed/processor_test.go
line 821 at r7 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
@erikgrinaker Do you mind taking a quick look at this change I just made? I missed a failing test and would like to confirm my approach. Initially, the test's budget was 40 which was enough because in-flight message was only 16 bytes. With my changes, each inflight message is now 238 bytes which caused this test to time out and not have enough budget to send the events. I adjusted the budget to 250 bytes here which I believe matches the original intent of this test - ensuring inflight messages are within the budget, and memory allocated for in-flight messages are properly released.
LGTM!
TFTR!! bors r=erikgrinaker |
121138: storage,ingest: enable ingestion optimization using flushableIngest r=aadityasondhi a=aadityasondhi In cockroachdb/pebble#3398, we introduced an optimization in pebble where we could use flushableIngests with excises for ingesting SSTs with RangeDels and RangeKeyDels. This patch turns this optimization on using `sstContainsExciseTombstone`. Informs cockroachdb/pebble#3335. Release note: None 121164: Revert "rangefeed: unconditionally use rangefeed scheduler" r=erikgrinaker a=erikgrinaker This reverts 3/4 commits from #114410, only leaving 1dbdbe9 which enabled the scheduler by default. There was significant code drift in tests, but I think I've been able to address the conflicts. This revert does not integrate with the changes from #120347, so the legacy rangefeed processor will not use the improved memory accounting. We still shouldn't leak budget allocations, since the processor implementations are entirely independent, but it's worth having a second look to make sure the registry (which is used by both processors) don't make assumptions about this. Touches #121054. Epic: none. Release note: the following release note from #114410 should be reverted/ignored: "the setting kv.rangefeed.scheduler.enabled has been removed, as the rangefeed scheduler is now unconditionally enabled". 121212: security: wrap error when parsing DN r=rafiss a=rafiss Epic: CRDB-34126 Release note: None Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
To prevent OOMs, we previously introduced memory budget to limit event buffering
to processor.eventC channel. These events could point to large underlying data
structure and keeping them in the channel prevent them from being garbage
collected. However, the current memory accounting did not have a good way to
predict the memory footprint of events. It uses calculateDateSize which 1. does
not account for memory of the new RangeFeedEvents that could be spawned (such as
checkpoint events) 2. inaccurately uses protobuf-encoded wire size for data 3.
does not account for memory of the base structs.
This patch improves the memory estimation by resolving the three points above.
It tries its best effort to account for base struct's memory, predict generation
of new events, and use actual data size rather than compressed protobuf size.
Since it is challenging to predict whether there would be a new checkpoint
event, our strategy is:
Instead of accounting memory for both the current event and future rangefeed
events, it will always pick the maximum memory between the two. It also allows
additional shared event overhead (should be very small) and checkpoint events
(caused by logical ops) (should be rare) to slip away.
always larger if we assume no checkpoint events will be published for any
logical ops. (same underlying data)
event. Future events are larger. (curr: 80, future: 152) (span points to the
same underlying data)
larger than current base structs by 2) (same underlying data).
larger as no future events will be created.
We got these data ^ from test benchmarks documented here
#120582.
Resolves: #114190
Release note: None