cache,db: de-dup concurrent attempts to read the same block #4157

sumeerbhola · 2024-11-12T00:54:12Z

Concurrent reads of the same block have been observed to cause very high
memory usage, and cause significant CPU usage for allocations/deallocations.
We now coordinate across multiple concurrent attempts to read the same
block via a readEntry, which makes the readers take turns until one
succeeds.

The readEntries are embedded in a map that is part of a readShard, where
there is a readShard for each cache.Shard. See the long comment in the
readShard declaration for motivation.

Callers interact with this new behavior via Cache.GetWithReadHandle,
which is only for callers that intend to do a read and then populate
the cache. If this method returns a ReadHandle, the caller has permission
to do a read. See the ReadHandle comment for details of the contract.

Fixes #4138

cockroach-teamcity · 2024-11-12T00:54:19Z

This change is

sumeerbhola

This lacks tests -- I would like an opinion on the approach and interfaces before adding those.

Reviewable status: 0 of 8 files reviewed, all discussions resolved (waiting on @RaduBerinde)

RaduBerinde

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)

sstable/reader.go line 525 at r1 (raw file):

		var err error
		var errorDuration time.Duration
		ch, errorDuration, err = crh.WaitForReadPermissionOrHandle(ctx)

Have you considered a GetOrPopulate API where we pass a func that does the actual read (or we can define a simple interface that *Reader implements)? Then the new logic would be internal to GetOrPopulate. All the caller cares is that at the end we either get a BufferHandle or an error. We can keep the simple Get for the buffer pool case.

sumeerbhola

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @RaduBerinde)

sstable/reader.go line 525 at r1 (raw file):

Previously, RaduBerinde wrote…

Have you considered a GetOrPopulate API where we pass a func that does the actual read (or we can define a simple interface that *Reader implements)? Then the new logic would be internal to GetOrPopulate. All the caller cares is that at the end we either get a BufferHandle or an error. We can keep the simple Get for the buffer pool case.

I didn't seriously consider it since there is a lot of shared code in readBlockInternal but there is also enough branching for the cache.ReadHandle case. We would want to write one func that does the actual read in both cases. So it would need to have the branching too. It would avoid the need for the following blocks which precede some return statements:

if crh.Valid() {
			crh.SetReadError(err)
}

I've moved those into a defer call here, so that by itself is not a big motivator for GetOrPopulate.

I've now done an implementation of readBlockInternal (readBlockInternal2), using (an unimplemented) GetOrPopulate in sumeerbhola@bac895b. PTAL. By itself, it isn't simpler IMO. But it does hide ReadHandle so it is perhaps compelling enough. Thoughts?

RaduBerinde

Reviewable status: 0 of 8 files reviewed, 3 unresolved discussions (waiting on @jbowens and @sumeerbhola)

sstable/reader.go line 525 at r1 (raw file):

Previously, sumeerbhola wrote…

I didn't seriously consider it since there is a lot of shared code in readBlockInternal but there is also enough branching for the cache.ReadHandle case. We would want to write one func that does the actual read in both cases. So it would need to have the branching too. It would avoid the need for the following blocks which precede some return statements:
if crh.Valid() {
			crh.SetReadError(err)
}
I've moved those into a defer call here, so that by itself is not a big motivator for GetOrPopulate.

I've now done an implementation of readBlockInternal (readBlockInternal2), using (an unimplemented) GetOrPopulate in sumeerbhola@bac895b. PTAL. By itself, it isn't simpler IMO. But it does hide ReadHandle so it is perhaps compelling enough. Thoughts?

Yeah, that looks worse.. I was envisioning that the function it calls would be a separate method, but I haven't considered all the details.

Why do we need a separate WaitForReadPermissionOrHandle? Can't this happen inside GetWithReadHandle - it could return either the data from the cache (regardless if we had to wait for it), or the handle that we will use to populate the cache after we do the read. I think only the first to create the entry should do the read, and anyone who comes in after should wait.

CC @jbowens as well

sstable/reader.go line 505 at r2 (raw file):

	if env.BufferPool == nil {
		ch, crh = r.cacheOpts.Cache.GetWithReadHandle(
			r.cacheOpts.CacheID, r.cacheOpts.FileNum, bh.Offset, r.loadBlockSema)

Why do we need to move the semaphore acquire into the cache code? That should just happen right before we issue the read, with the expectation that outside of strange conditions it won't do anything (and if it does, it should just behave like a slower IO). We can treat a ctx cancelation error the same way we would treat an IO error.

sstable/reader.go line 552 at r2 (raw file):

	// INVARIANT: !ch.Valid().

	compressed := block.Alloc(int(bh.Length+block.TrailerLen), env.BufferPool)

It would be cleaner to separate out the code starting here into a method that does the actual read and returns the decompressed block handle. We can SetReadError more cleanly after that call (this is error prone if someone adds an if err := ... case).

jbowens

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)

internal/cache/read_shard.go line 92 at r2 (raw file):

NB: we cannot place the loadBlockSema in the readShard since the readShard can be shared across the DBs, just like the block cache.

Should we make the load block semaphore shared across DBs, like the block cache? The primary motivation was preventing excessive memory utilization from many concurrent reads, right? And memory is a shared resource across DBs.

RaduBerinde · 2024-11-14T17:23:20Z

Should we make the load block semaphore shared across DBs, like the block cache? The primary motivation was preventing excessive memory utilization from many concurrent reads, right? And memory is a shared resource across DBs.

That would have been easier to implement, but there was concern that one store that is broken and has very slow IOs will block out all other stores.
https://github.com/cockroachdb/cockroach/blob/3644f0d3fe77c03c20d3b603c1c0eb6f335e7e15/pkg/storage/pebble.go#L997

sumeerbhola

I've reworked the logic based on the comments. I'll start adding tests if this seems reasonable.

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @jbowens and @RaduBerinde)

internal/cache/read_shard.go line 92 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

NB: we cannot place the loadBlockSema in the readShard since the readShard can be shared across the DBs, just like the block cache.

Should we make the load block semaphore shared across DBs, like the block cache? The primary motivation was preventing excessive memory utilization from many concurrent reads, right? And memory is a shared resource across DBs.

As @RaduBerinde mentioned, this choice was made for performance isolation of stores.

sstable/reader.go line 525 at r1 (raw file):

Why do we need a separate WaitForReadPermissionOrHandle? Can't this happen inside GetWithReadHandle.

Yes, it can. It involves passing the context and needs additional return values. I've made the change.

sstable/reader.go line 505 at r2 (raw file):

Previously, RaduBerinde wrote…

Why do we need to move the semaphore acquire into the cache code? That should just happen right before we issue the read, with the expectation that outside of strange conditions it won't do anything (and if it does, it should just behave like a slower IO). We can treat a ctx cancelation error the same way we would treat an IO error.

Initially it felt cleaner to move all the waiting in one place. And it required less explanation about the contract. But given that we need to still do semaphore acquisition for the BufferPool case, I've moved it back.

The contract comment now says:

// The caller must immediately start doing a read, or can first wait on a
// shared resource that would also block a different reader if it was assigned
// the turn instead (specifically, this refers to Options.LoadBlockSema).

sstable/reader.go line 552 at r2 (raw file):

Previously, RaduBerinde wrote…

It would be cleaner to separate out the code starting here into a method that does the actual read and returns the decompressed block handle. We can SetReadError more cleanly after that call (this is error prone if someone adds an if err := ... case).

Good point. Done

RaduBerinde

The approach looks good to me. I will make another pass in more detail but you can start adding tests.

Reviewable status: 0 of 8 files reviewed, 5 unresolved discussions (waiting on @jbowens and @sumeerbhola)

internal/cache/clockpro.go line 206 at r3 (raw file):

			// etHot. But etTest is "colder" than etCold, since the only transition
			// into etTest is etCold => etTest, so since etTest transitions to
			// etHot, then etCold should also transition.

I think you're right. The paper says that if the page isn't in the list it is added as cold (the e == nil case) but if it is "the faulted page turns into a hot page and is placed at the head of the list"

Maybe file an issue? We'd want it in a separate PR

sumeerbhola

Reviewable status: 0 of 8 files reviewed, 5 unresolved discussions (waiting on @jbowens and @RaduBerinde)

internal/cache/clockpro.go line 206 at r3 (raw file):

Previously, RaduBerinde wrote…

I think you're right. The paper says that if the page isn't in the list it is added as cold (the e == nil case) but if it is "the faulted page turns into a hot page and is placed at the head of the list"

Maybe file an issue? We'd want it in a separate PR

#4178

sumeerbhola

Tests are ready.

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @jbowens and @RaduBerinde)

RaduBerinde

Very cool!

Reviewable status: 0 of 10 files reviewed, 4 unresolved discussions (waiting on @jbowens and @sumeerbhola)

internal/cache/clockpro.go line 141 at r4 (raw file):

// Cache.{Get,GetWithReadHandle}. When desireReadEntry is true, and the block
// is not in the cache (!Handle.Valid()), a non-nil readEntry is returned.
func (c *shard) GetWithMaybeReadEntry(

[nit] does it need to be exported?

internal/cache/read_shard.go line 69 at r4 (raw file):

//     separate map. This separation also results in more modular code, instead
//     of piling more stuff into shard.
type readShard struct {

[nit] Would it be cleaner to just add readMap to shard and just implement the methods on *shard? We can still keep them separate.

internal/cache/read_shard.go line 226 at r4 (raw file):

		return errorDuration
	}
start:

We should do for { and use continue. Well, we don't even need continue, we can do

case _, ok := <-ch:
  if !ok {
    ...
    return ..
  }
  // Probably granted permission to do the read; check again. NB: since isReading is
  // false, someone else can slip through before this thread acquires
  // e.mu, and take the turn.

Concurrent reads of the same block have been observed to cause very high memory usage, and cause significant CPU usage for allocations/deallocations. We now coordinate across multiple concurrent attempts to read the same block via a readEntry, which makes the readers take turns until one succeeds. The readEntries are embedded in a map that is part of a readShard, where there is a readShard for each cache.Shard. See the long comment in the readShard declaration for motivation. Callers interact with this new behavior via Cache.GetWithReadHandle, which is only for callers that intend to do a read and then populate the cache. If this method returns a ReadHandle, the caller has permission to do a read. See the ReadHandle comment for details of the contract. Fixes cockroachdb#4138

sumeerbhola

TFTR!

Reviewable status: 0 of 10 files reviewed, 4 unresolved discussions (waiting on @jbowens and @RaduBerinde)

internal/cache/clockpro.go line 141 at r4 (raw file):

Previously, RaduBerinde wrote…

[nit] does it need to be exported?

Fixed. I fell into the trap of following the pattern of other methods in shard being exported (Set/Delete/...), none of which need to be.

internal/cache/read_shard.go line 69 at r4 (raw file):

Previously, RaduBerinde wrote…

[nit] Would it be cleaner to just add readMap to shard and just implement the methods on *shard? We can still keep them separate.

This is subjective, but I like the bigger separation. Putting this data-structure in shard will move a bunch of commentary above into the shard declaration. Some of the methods could still be in this file of course. Keeping this more separate I think makes our code more easy to maintain.

internal/cache/read_shard.go line 226 at r4 (raw file):

Previously, RaduBerinde wrote…

We should do for { and use continue. Well, we don't even need continue, we can do

case _, ok := <-ch:
  if !ok {
    ...
    return ..
  }
  // Probably granted permission to do the read; check again. NB: since isReading is
  // false, someone else can slip through before this thread acquires
  // e.mu, and take the turn.

Good point. Done.

RaduBerinde

Reviewed 5 of 10 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @jbowens)

internal/cache/read_shard.go line 69 at r4 (raw file):

Previously, sumeerbhola wrote…

This is subjective, but I like the bigger separation. Putting this data-structure in shard will move a bunch of commentary above into the shard declaration. Some of the methods could still be in this file of course. Keeping this more separate I think makes our code more easy to maintain.

👍

RaduBerinde · 2024-12-12T20:34:36Z

CI occasionally encounters entry was not freed failures; it might be related to this PR.

https://github.com/cockroachdb/pebble/actions/runs/12304080308/job/34340708418?pr=4199
https://github.com/cockroachdb/pebble/actions/runs/12300880108/job/34330136010
https://github.com/cockroachdb/pebble/actions/runs/12300880108/job/34330132989

sumeerbhola requested a review from RaduBerinde November 12, 2024 00:54

sumeerbhola requested a review from a team as a code owner November 12, 2024 00:54

sumeerbhola commented Nov 12, 2024

View reviewed changes

RaduBerinde reviewed Nov 12, 2024

View reviewed changes

sumeerbhola force-pushed the cache_parallel branch 2 times, most recently from 40fbe2c to 23f4b7f Compare November 13, 2024 18:44

sumeerbhola requested a review from RaduBerinde November 13, 2024 18:53

sumeerbhola commented Nov 13, 2024

View reviewed changes

RaduBerinde reviewed Nov 14, 2024

View reviewed changes

jbowens requested a review from RaduBerinde November 14, 2024 17:19

jbowens reviewed Nov 14, 2024

View reviewed changes

sumeerbhola force-pushed the cache_parallel branch from 23f4b7f to 5316806 Compare November 16, 2024 04:19

sumeerbhola requested a review from jbowens November 16, 2024 04:21

sumeerbhola commented Nov 16, 2024

View reviewed changes

sumeerbhola force-pushed the cache_parallel branch from 5316806 to bf70930 Compare November 16, 2024 04:24

RaduBerinde reviewed Nov 18, 2024

View reviewed changes

sumeerbhola mentioned this pull request Nov 21, 2024

cache: setting an existing cold entry should make it hot #4178

Open

sumeerbhola requested a review from RaduBerinde November 21, 2024 15:04

sumeerbhola commented Nov 21, 2024

View reviewed changes

sumeerbhola force-pushed the cache_parallel branch from bf70930 to 0c589ad Compare November 26, 2024 19:45

sumeerbhola force-pushed the cache_parallel branch 3 times, most recently from 1b49df6 to f299ee0 Compare December 5, 2024 02:03

sumeerbhola commented Dec 5, 2024

View reviewed changes

RaduBerinde approved these changes Dec 10, 2024

View reviewed changes

sumeerbhola force-pushed the cache_parallel branch from f299ee0 to 9ec9476 Compare December 10, 2024 23:01

sumeerbhola force-pushed the cache_parallel branch from 9ec9476 to 4f7bb5f Compare December 10, 2024 23:03

sumeerbhola requested a review from RaduBerinde December 10, 2024 23:05

sumeerbhola commented Dec 10, 2024

View reviewed changes

RaduBerinde approved these changes Dec 10, 2024

View reviewed changes

sumeerbhola merged commit 4949684 into cockroachdb:master Dec 11, 2024
23 checks passed

RaduBerinde mentioned this pull request Dec 12, 2024

internal/compact: defer retrieval of external values #4198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache,db: de-dup concurrent attempts to read the same block #4157

cache,db: de-dup concurrent attempts to read the same block #4157

sumeerbhola commented Nov 12, 2024 •

edited

Loading

cockroach-teamcity commented Nov 12, 2024

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

RaduBerinde left a comment

jbowens left a comment

RaduBerinde commented Nov 14, 2024

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

RaduBerinde left a comment

RaduBerinde commented Dec 12, 2024

cache,db: de-dup concurrent attempts to read the same block #4157

cache,db: de-dup concurrent attempts to read the same block #4157

Conversation

sumeerbhola commented Nov 12, 2024 • edited Loading

cockroach-teamcity commented Nov 12, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

RaduBerinde commented Nov 14, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde commented Dec 12, 2024

sumeerbhola commented Nov 12, 2024 •

edited

Loading