Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rangefeed: improve memory accounting for event queue #120347

Merged
merged 1 commit into from
Mar 22, 2024

Conversation

wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Mar 12, 2024

To prevent OOMs, we previously introduced memory budget to limit event buffering
to processor.eventC channel. These events could point to large underlying data
structure and keeping them in the channel prevent them from being garbage
collected. However, the current memory accounting did not have a good way to
predict the memory footprint of events. It uses calculateDateSize which 1. does
not account for memory of the new RangeFeedEvents that could be spawned (such as
checkpoint events) 2. inaccurately uses protobuf-encoded wire size for data 3.
does not account for memory of the base structs.

This patch improves the memory estimation by resolving the three points above.
It tries its best effort to account for base struct's memory, predict generation
of new events, and use actual data size rather than compressed protobuf size.
Since it is challenging to predict whether there would be a new checkpoint
event, our strategy is:

Instead of accounting memory for both the current event and future rangefeed
events, it will always pick the maximum memory between the two. It also allows
additional shared event overhead (should be very small) and checkpoint events
(caused by logical ops) (should be rare) to slip away.

  • For logical ops, we always take current event memory. Current event memory is
    always larger if we assume no checkpoint events will be published for any
    logical ops. (same underlying data)
  • For checkpoint events (ct and initRTS), we always take checkpoint rangefeed
    event. Future events are larger. (curr: 80, future: 152) (span points to the
    same underlying data)
  • For sst events, we always take future rangefeed event (future base structs is
    larger than current base structs by 2) (same underlying data).
  • For sync events, we always take current events. (current memory usage is
    larger as no future events will be created.

We got these data ^ from test benchmarks documented here
#120582.

Resolves: #114190
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

blathers-crl bot commented Mar 13, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@wenyihu6 wenyihu6 marked this pull request as ready for review March 13, 2024 12:58
@wenyihu6 wenyihu6 requested review from a team and erikgrinaker March 13, 2024 12:58
randomlyChosenEvent string,
randomlyChosenLogicalOp string,
) {
// Opt out sst event since it requires testing.T to call
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark randomly generates lots of events, and every types of events with equal probability which is not true in production (for example, initRTS should happen much less frequent than logicalOps). I'm will improve the benchmark again with a weighted rand distribution soon.

@wenyihu6 wenyihu6 changed the title [wip] rangefeed: improve memory accounting for event queue rangefeed: improve memory accounting for event queue Mar 13, 2024
@wenyihu6 wenyihu6 force-pushed the firstpr branch 2 times, most recently from 416a9f8 to a11824c Compare March 13, 2024 14:42
Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/rangefeed/event_size.go line 240 at r2 (raw file):

	// data and was already accounted in MVCCCommitIntentOp.
	currMemUsage := mvccCommitIntentOp
	currMemUsage += int64(cap(txnID))

I'm assuming any underlying data here is non-overlapping. We might be over-accounting if some keys or values point at the same underlying data. But I don't think there is an easy way to track and avoid this.


pkg/kv/kvserver/rangefeed/event_size.go line 352 at r2 (raw file):

	// event will be created.
	futureMemUsage += rangefeedCheckpointOpMemUsage() * int64(len(ops))
	return currMemUsage, futureMemUsage

In theory, max(currMemUsage, futureMemUsage) for each op could be pre-determined given they share the same underlying data and the only diff lies between the constant base struct values. The rationale behind the current approach is: instead of picking max(currMemUsage, futureMemUsage) for every op, we want to compare the current states before events are published VS the future state after all events are published. This might be different from picking max(currMemUsage, futureMemUsage) for each op. There could be some middle state where first logicalOp has been published but the second one hasn't, but I'm ignoring for now since the middle state soonly ends after this consumeEvent finishes. I'm also assuming publishToOverlapping cannot called concurrently (which seems true as of now) since otherwise there could be multiple sharedEvents simultaneously.

@wenyihu6 wenyihu6 force-pushed the firstpr branch 4 times, most recently from ca3f5e4 to 79323d7 Compare March 15, 2024 05:31
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Mar 15, 2024

Upon more reflection, my initial observation about only needing to account memory for one sharedEvent base struct for registrations feels off. We end up make copies of sharedEvent, and they might all remain buffered in the registration buf channel. The memory for the base struct is released only after registration sends it.

I'm making another PR which has a cleaner design and handles this. It pre-allocates for all current + future events base structs plus its underlying data. Underlying data does not get released until the very end, but we gradually release memory based on whether checkpoint events are published or if current event struts is gone. Lmk which one you would prefer.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/rangefeed/event_size.go line 49 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Upon more reflection, my initial observation about only needing to account memory for one sharedEvent base struct for registrations feels off. We end up make copies of sharedEvent, and they might all remain buffered in the registration buf channel. The memory for the base struct is released only after registration sends it https://github.com/cockroachdb/cockroach/blob/a7586d1e6d26eda5f9dad05f76396263632ef265/pkg/kv/kvserver/rangefeed/registry.go#L341. I will add a second commit for this and squash if you agree.

You're right that we create a new sharedEvent copy for every registration, but I think we can ignore it. A sharedEvent is only 16 bytes (two pointers), so the overhead is fairly small.

It's also worth considering that a buffered channel will use memory as if it was full regardless of the number of events that it contains -- in other words, a chan(*sharedEvent, 4096) will use 8*4096 bytes of memory (32 KB) even when it's empty. When we fill it up, we have (8+16)*4096 bytes (96 KB) to account for the actual sharedEvent structs. If we wanted to, we could cut this to a fixed 16*4096 bytes (64 KB) by dropping the pointer and using chan(sharedEvent, 4096), which would eliminate half of the overhead of the sharedEvent struct, but I don't think it matters all that much. Each registration also has a lot of other overhead like the goroutine stack and gRPC buffers and such which is likely larger than this and which isn't accounted for at all.

This is all to say that I don't think we should add a lot of complexity here just to account for these tiny sharedEvents. If we later find that we need it we can deal with it then.


pkg/kv/kvserver/rangefeed/event_size.go line 240 at r2 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I'm assuming any underlying data here is non-overlapping. We might be over-accounting if some keys or values point at the same underlying data. But I don't think there is an easy way to track and avoid this.

This is typically decoded from the Raft log before it's handed to us, so it will be individual copies.


pkg/kv/kvserver/rangefeed/event_size.go line 352 at r2 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

In theory, max(currMemUsage, futureMemUsage) for each op could be pre-determined given they share the same underlying data and the only diff lies between the constant base struct values. The rationale behind the current approach is: instead of picking max(currMemUsage, futureMemUsage) for every op, we want to compare the current states before events are published VS the future state after all events are published. This might be different from picking max(currMemUsage, futureMemUsage) for each op. There could be some middle state where first logicalOp has been published but the second one hasn't, but I'm ignoring for now since the middle state soonly ends after this consumeEvent finishes. I'm also assuming publishToOverlapping cannot called concurrently (which seems true as of now) since otherwise there could be multiple sharedEvents simultaneously.

I'm not sure we necessarily need to split this into currMemUsage and futureMemUsage. From a quick glance at the events, it seems like an event will typically be larger than the resulting set of RangeFeedEvent, so computing that alone will be sufficient? And even if it isn't, the difference shouldn't be large enough to really matter? The former will be garbage collected once the latter has been produced.

Let's consider the two representative examples: a batch of 10 writes, and a checkpoint.

A batch of 10 writes is:

	_ = event{
		ops: opsEvent{
			enginepb.MVCCLogicalOp{
				WriteValue: &enginepb.MVCCWriteValueOp{
					Key:              roachpb.Key("foo"),
					Timestamp:        hlc.Timestamp{},
					Value:            []byte("value"),
					PrevValue:        []byte("prev"),
					OmitInRangefeeds: false,
				},
				WriteIntent:  nil,
				UpdateIntent: nil,
				CommitIntent: nil,
				AbortIntent:  nil,
				AbortTxn:     nil,
				DeleteRange:  nil,
			},
			// ...repeated 9 times
		},
		ct:      ctEvent{},
		initRTS: initRTSEvent(false),
		sst:     nil,
		sync:    nil,
		alloc:   &SharedBudgetAllocation{},
	}

	_ = &kvpb.RangeFeedEvent{
		Val: &kvpb.RangeFeedValue{
			Key:       roachpb.Key("foo"),
			Value:     roachpb.Value{Timestamp: hlc.Timestamp{}, RawBytes: []byte("value")},
			PrevValue: roachpb.Value{Timestamp: hlc.Timestamp{}, RawBytes: []byte("prev")},
		},
		Checkpoint:  nil,
		Error:       nil,
		SST:         nil,
		DeleteRange: nil,
	}
	// ...repeated 9 times

Just by eyeballing it, we can tell that the event uses more memory than RangeFeedEvent, because the structs are larger and contain more stuff. If you print the output from your computations, I'm sure you'll find the same result. This also applies for checkpoints, compare:

	_ = event{
		ops:     nil,
		ct:      ctEvent{Timestamp: hlc.Timestamp{WallTime: 1, Logical: 0}},
		initRTS: initRTSEvent(false),
		sst:     nil,
		sync:    nil,
		alloc:   &SharedBudgetAllocation{},
	}

	_ = &kvpb.RangeFeedEvent{
		Val: nil,
		Checkpoint: &kvpb.RangeFeedCheckpoint{
			// Span just contains references to the rangefeed's span, so two pointers = 16 bytes.
			Span:       roachpb.Span{Key: roachpb.Key("a"), EndKey: roachpb.Key("b")},
			ResolvedTS: hlc.Timestamp{WallTime: 1, Logical: 0},
		},
		Error:       nil,
		SST:         nil,
		DeleteRange: nil,
	}

Where I think this ends up using about the same memory -- or it's otherwise only off by a few bytes.

So I think we should keep this simple and just compute the size of the event, with the assumption that the resulting Protobuf events will generally be smaller or roughly the same size.

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/rangefeed/event_size.go line 49 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

You're right that we create a new sharedEvent copy for every registration, but I think we can ignore it. A sharedEvent is only 16 bytes (two pointers), so the overhead is fairly small.

It's also worth considering that a buffered channel will use memory as if it was full regardless of the number of events that it contains -- in other words, a chan(*sharedEvent, 4096) will use 8*4096 bytes of memory (32 KB) even when it's empty. When we fill it up, we have (8+16)*4096 bytes (96 KB) to account for the actual sharedEvent structs. If we wanted to, we could cut this to a fixed 16*4096 bytes (64 KB) by dropping the pointer and using chan(sharedEvent, 4096), which would eliminate half of the overhead of the sharedEvent struct, but I don't think it matters all that much. Each registration also has a lot of other overhead like the goroutine stack and gRPC buffers and such which is likely larger than this and which isn't accounted for at all.

This is all to say that I don't think we should add a lot of complexity here just to account for these tiny sharedEvents. If we later find that we need it we can deal with it then.

Ack. Happy to just keep it simple then. I will keep the other PR as a draft PR in case we want to revisit this in the future.


pkg/kv/kvserver/rangefeed/event_size.go line 352 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I'm not sure we necessarily need to split this into currMemUsage and futureMemUsage. From a quick glance at the events, it seems like an event will typically be larger than the resulting set of RangeFeedEvent, so computing that alone will be sufficient? And even if it isn't, the difference shouldn't be large enough to really matter? The former will be garbage collected once the latter has been produced.

Let's consider the two representative examples: a batch of 10 writes, and a checkpoint.

A batch of 10 writes is:

	_ = event{
		ops: opsEvent{
			enginepb.MVCCLogicalOp{
				WriteValue: &enginepb.MVCCWriteValueOp{
					Key:              roachpb.Key("foo"),
					Timestamp:        hlc.Timestamp{},
					Value:            []byte("value"),
					PrevValue:        []byte("prev"),
					OmitInRangefeeds: false,
				},
				WriteIntent:  nil,
				UpdateIntent: nil,
				CommitIntent: nil,
				AbortIntent:  nil,
				AbortTxn:     nil,
				DeleteRange:  nil,
			},
			// ...repeated 9 times
		},
		ct:      ctEvent{},
		initRTS: initRTSEvent(false),
		sst:     nil,
		sync:    nil,
		alloc:   &SharedBudgetAllocation{},
	}

	_ = &kvpb.RangeFeedEvent{
		Val: &kvpb.RangeFeedValue{
			Key:       roachpb.Key("foo"),
			Value:     roachpb.Value{Timestamp: hlc.Timestamp{}, RawBytes: []byte("value")},
			PrevValue: roachpb.Value{Timestamp: hlc.Timestamp{}, RawBytes: []byte("prev")},
		},
		Checkpoint:  nil,
		Error:       nil,
		SST:         nil,
		DeleteRange: nil,
	}
	// ...repeated 9 times

Just by eyeballing it, we can tell that the event uses more memory than RangeFeedEvent, because the structs are larger and contain more stuff. If you print the output from your computations, I'm sure you'll find the same result. This also applies for checkpoints, compare:

	_ = event{
		ops:     nil,
		ct:      ctEvent{Timestamp: hlc.Timestamp{WallTime: 1, Logical: 0}},
		initRTS: initRTSEvent(false),
		sst:     nil,
		sync:    nil,
		alloc:   &SharedBudgetAllocation{},
	}

	_ = &kvpb.RangeFeedEvent{
		Val: nil,
		Checkpoint: &kvpb.RangeFeedCheckpoint{
			// Span just contains references to the rangefeed's span, so two pointers = 16 bytes.
			Span:       roachpb.Span{Key: roachpb.Key("a"), EndKey: roachpb.Key("b")},
			ResolvedTS: hlc.Timestamp{WallTime: 1, Logical: 0},
		},
		Error:       nil,
		SST:         nil,
		DeleteRange: nil,
	}

Where I think this ends up using about the same memory -- or it's otherwise only off by a few bytes.

So I think we should keep this simple and just compute the size of the event, with the assumption that the resulting Protobuf events will generally be smaller or roughly the same size.

I ran some more benchmark.

If we just want to pick the maximum between current and future events, we can say:

  • For logical ops, current event memory is always larger if we assume no checkpoint events will be published for any logical ops. (I think you mentioned that these checkpoint events are very minimal.) (same underlying data)
  • For checkpoint events (ct and initRTS), future event is larger. (curr: 80, future: 152) (span points to the same underlying data)
  • For sst event, future base structs is larger than current base structs by 2 (same underlying data).
  • For sync event, current event is always larger as no future events will be created.

I'm happy to just pick the max between current and future and take that as an estimation to keep things simple. We can revisit this in the future if anything comes up.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r1, 1 of 3 files at r2, 3 of 3 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/rangefeed/event_size.go line 33 at r6 (raw file):

)

const (

nit: use a single const block for these.


pkg/kv/kvserver/rangefeed/event_size.go line 208 at r6 (raw file):

	currMemUsage := mvccLogicalOp * int64(cap(ops))
	for _, op := range ops {
		switch t := op.GetValue().(type) {

Let's add a default clause that fatals, so we catch any cases that might be added later.


pkg/kv/kvserver/rangefeed/event_size.go line 243 at r6 (raw file):

	// currMemUsage: pointer to e is passed to p.eventC channel. Each e pointer is
	// &event{}, and each pointer points at an underlying event{}.
	switch {

Let's add a default clause here as well that fatals.


pkg/kv/kvserver/rangefeed/scheduled_processor.go line 702 at r6 (raw file):

		// move forward. If so, publish a RangeFeedCheckpoint notification.
		if p.rts.ConsumeLogicalOp(ctx, op) {
			log.Infof(ctx, "memory accounting inaccurate: %d bytes slipped through", rangefeedCheckpointOpMemUsage())

Let's remove this, I don't think operators care about this at all and it will just confuse them.


pkg/kv/kvserver/rangefeed/event_size_test.go line 194 at r6 (raw file):

		Timestamp: testTs,
	})
	expectedMemUsage += mvccDeleteRangeOp + int64(cap(startKey)) + int64(cap(endKey))

We should make sure some tests here will catch changes to the structs. Right now, if someone adds e.g. a new []byte field, the test will still pass, but we're not accounting for it. We should have a test that fails to remind people to update the memory account here. You could consider using something like echotest for this, or some other mechanism that you prefer -- your call.

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/rangefeed/event_size_test.go line 194 at r6 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

We should make sure some tests here will catch changes to the structs. Right now, if someone adds e.g. a new []byte field, the test will still pass, but we're not accounting for it. We should have a test that fails to remind people to update the memory account here. You could consider using something like echotest for this, or some other mechanism that you prefer -- your call.

I want to get this in before the branch cut, but I created a GitHub issue tracking this. I'm planning to work on it next week.


pkg/kv/kvserver/rangefeed/processor_test.go line 821 at r7 (raw file):

func TestProcessorMemoryBudgetReleased(t *testing.T) {
	defer leaktest.AfterTest(t)()
	fb := newTestBudget(250)

@erikgrinaker Do you mind taking a quick look at this change I just made? I missed a failing test and would like to confirm my approach. Initially, the test's budget was 40 which was enough because in-flight message was only 16 bytes. With my changes, each inflight message is now 238 bytes which caused this test to time out and not have enough budget to send the events. I adjusted the budget to 250 bytes here which I believe matches the original intent of this test - ensuring inflight messages are within the budget, and memory allocated for in-flight messages are properly released.

To prevent OOMs, we previously introduced memory budget to limit event buffering
to processor.eventC channel. These events could point to large underlying data
structure and keeping them in the channel prevent them from being garbage
collected. However, the current memory accounting did not have a good way to
predict the memory footprint of events. It uses calculateDateSize which 1. does
not account for memory of the new RangeFeedEvents that could be spawned (such as
checkpoint events) 2. inaccurately uses protobuf-encoded wire size for data 3.
does not account for memory of the base structs.

This patch improves the memory estimation by resolving the three points above.
It tries its best effort to account for base struct's memory, predict generation
of new events, and use actual data size rather than compressed protobuf size.
Since it is challenging to predict whether there would be a new checkpoint
event, our strategy is:

Instead of accounting memory for both the current event and future rangefeed
events, it will always pick the maximum memory between the two. It also allows
additional shared event overhead (should be very small) and checkpoint events
(caused by logical ops) (should be rare) to slip away.

- For logical ops, we always take current event memory. Current event memory is
always larger if we assume no checkpoint events will be published for any
logical ops. (same underlying data)
- For checkpoint events (ct and initRTS), we always take checkpoint rangefeed
event. Future events are larger. (curr: 80, future: 152) (span points to the
same underlying data)
- For sst events, we always take future rangefeed event (future base structs is
larger than current base structs by 2) (same underlying data).
- For sync events, we always take current events. (current memory usage is
larger as no future events will be created.

We got these data ^ from test benchmarks documented here
cockroachdb#120582.

Resolves: cockroachdb#114190
Release note: None
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r7, 3 of 3 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/rangefeed/event_size_test.go line 194 at r6 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I want to get this in before the branch cut, but I created a GitHub issue tracking this. I'm planning to work on it next week.

👍


pkg/kv/kvserver/rangefeed/processor_test.go line 821 at r7 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

@erikgrinaker Do you mind taking a quick look at this change I just made? I missed a failing test and would like to confirm my approach. Initially, the test's budget was 40 which was enough because in-flight message was only 16 bytes. With my changes, each inflight message is now 238 bytes which caused this test to time out and not have enough budget to send the events. I adjusted the budget to 250 bytes here which I believe matches the original intent of this test - ensuring inflight messages are within the budget, and memory allocated for in-flight messages are properly released.

LGTM!

@wenyihu6
Copy link
Contributor Author

TFTR!!

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Mar 22, 2024

@craig craig bot merged commit 01c8fa0 into cockroachdb:master Mar 22, 2024
22 checks passed
craig bot pushed a commit that referenced this pull request Mar 27, 2024
121138: storage,ingest: enable ingestion optimization using flushableIngest r=aadityasondhi a=aadityasondhi

In cockroachdb/pebble#3398, we introduced an optimization in pebble where we could use flushableIngests with excises for ingesting SSTs with RangeDels and RangeKeyDels.

This patch turns this optimization on using
`sstContainsExciseTombstone`.

Informs cockroachdb/pebble#3335.

Release note: None

121164: Revert "rangefeed: unconditionally use rangefeed scheduler" r=erikgrinaker a=erikgrinaker

This reverts 3/4 commits from #114410, only leaving 1dbdbe9 which enabled the scheduler by default.

There was significant code drift in tests, but I think I've been able to address the conflicts.

This revert does not integrate with the changes from #120347, so the legacy rangefeed processor will not use the improved memory accounting. We still shouldn't leak budget allocations, since the processor implementations are entirely independent, but it's worth having a second look to make sure the registry (which is used by both processors) don't make assumptions about this.

Touches #121054.
Epic: none.
Release note: the following release note from #114410 should be reverted/ignored: "the setting kv.rangefeed.scheduler.enabled has been removed, as the rangefeed scheduler is now unconditionally enabled".

121212: security: wrap error when parsing DN r=rafiss a=rafiss

Epic: CRDB-34126
Release note: None

Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@wenyihu6 wenyihu6 deleted the firstpr branch April 27, 2024 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver/rangefeed: checkpoint events can cause OOMs
3 participants