db: refactor obsolete file deletion code #2641

RaduBerinde · 2023-06-16T02:30:40Z

The code around obsolete file deletion is very convoluted and there is
one known race (described in #2630).

This change reworks the deletion code as follows:

we add a cleanup manager which has a background goroutine.
all file and object deletions happen in the manager's background
goroutine.
callers can enqueue cleanup jobs with the manager
tests can wait for outstanding cleanup jobs to finish

We also add a missing call to deleteObsoleteObjects from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect Remove call in the error path of
Checkpoint() (which was missing a PathJoin).

Fixes #2630.

cockroach-teamcity · 2023-06-16T02:30:47Z

This change is

jbowens

Looking good!

Reviewed 11 of 17 files at r1, all commit messages.
Reviewable status: 11 of 17 files reviewed, 8 unresolved discussions (waiting on @RaduBerinde)

cleaner.go line 50 at r1 (raw file):

// In practice, we should rarely have more than a couple of jobs (in most cases
// we Wait() after queueing a job).
const jobsChLen = 10000

is this maybe too big? With 8-byte words, 80KB?

cleaner.go line 103 at r1 (raw file):

// Note that logs in obsoleteFiles might be recycled instead of deleted.
//
// Returns a channel that is closed when the job completes.

Looks like this comment is out-of-date.

cleaner.go line 120 at r1 (raw file):

		}
		// Something is terribly wrong... Just drop the job.
		cm.opts.Logger.Infof("cleanup jobs queue full")

this makes me nervous even though it shouldn't happen. since enqueues are already mutex-protected, should we use a preallocated slice instead of a channel? if the slice becomes full, we can always suffer the allocation to grow it.

compaction.go line 2636 at r1 (raw file):

		d.updateTableStatsLocked(ve.NewFiles)
	}
	d.deleteObsoleteFiles(jobID, true /* waitForCompletion */)

I think this is a behavior change. Previously, with deletion pacing enabled,waitForOngoing would determine whether or not we should wait for our turn to call doDeleteObsoleteFiles, but that function wouldn't wait for the files to physically be deleted.

compaction.go line 3488 at r1 (raw file):

	d.mu.Unlock()
	defer d.mu.Lock()
	d.cleanupManager.Wait()

I didn't follow this. why do we wait for the cleaner but only if d.mu.disableFileDeletions <= 1? couldn't another goroutine have just incremented disableFileDeletions and dropped the mu to wait for the cleaner. Then when the next goroutine increments, it avoids waiting. Shouldn't they receive the same treatment?

compaction.go line 3569 at r1 (raw file):

	// Release d.mu while doing I/O
	// Note the unusual order: Unlock and then Lock.

this function no longer does I/O directly, right? Should we update this comment?

compaction.go line 3583 at r1 (raw file):

	}
	_, noRecycle := d.opts.Cleaner.(base.NeedsFileContents)
	filesToDelete := make([]obsoleteFile, 0, len(files))

looking at this existing code, is there any reason for this files array? couldn't we populate a single filesToDelete slice up above, and then down here perform all the same logic we do within the below for loop?

relatedly, I think this capacity for the filesToDelete slice wasn't the intended capacity. len(files) is always 4.

event_listener_test.go line 115 at r1 (raw file):

				defer d.mu.Unlock()
				d.enableFileDeletions()
				d.TestOnlyWaitForCleaning()

It looks d.mu is held here which is prohibited when calling cleanupManager.Wait(), right?

RaduBerinde

Reviewable status: 10 of 17 files reviewed, 9 unresolved discussions (waiting on @jbowens)

cleaner.go line 50 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

is this maybe too big? With 8-byte words, 80KB?

Reduced to 1000

cleaner.go line 103 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Looks like this comment is out-of-date.

Done.

cleaner.go line 120 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this makes me nervous even though it shouldn't happen. since enqueues are already mutex-protected, should we use a preallocated slice instead of a channel? if the slice becomes full, we can always suffer the allocation to grow it.

I'm hesitant to stop using the channel because the synchronization is much easier.. We could also do a blocking send, so that we wait for some jobs to finish? We already call EnqueueJob without holding the mutex.

compaction.go line 2636 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think this is a behavior change. Previously, with deletion pacing enabled,waitForOngoing would determine whether or not we should wait for our turn to call doDeleteObsoleteFiles, but that function wouldn't wait for the files to physically be deleted.

But it would wait for any deletion that was just queued (e.g. from maybeScheduleObsoleteTableDeletion) to physically delete files.. If we accept that maybeScheduleObsoleteTableDeletion can run at any time, then we should treat both cases the same..

One could argue that the waiting was just an artifact of the poor design (no queue) and we really don't want to wait anywhere.. But I think we want some kind of backpressure - with a low pacing limit, the deletions could become the bottleneck and we don't want them to fall behind indefinitely..

compaction.go line 3488 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I didn't follow this. why do we wait for the cleaner but only if d.mu.disableFileDeletions <= 1? couldn't another goroutine have just incremented disableFileDeletions and dropped the mu to wait for the cleaner. Then when the next goroutine increments, it avoids waiting. Shouldn't they receive the same treatment?

Nice catch, fixed.

compaction.go line 3569 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this function no longer does I/O directly, right? Should we update this comment?

Done.

compaction.go line 3583 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

looking at this existing code, is there any reason for this files array? couldn't we populate a single filesToDelete slice up above, and then down here perform all the same logic we do within the below for loop?

relatedly, I think this capacity for the filesToDelete slice wasn't the intended capacity. len(files) is always 4.

We'd have to do the "conversion" from fileInfo to obsoleteFile in each place (or at least somehow associate the fileType with each entry), and in the loop we'd have to delete recycled logs from the slice. Not sure it ends up better so I'll leave it as is. I fixed the capacity.

compaction.go line 2118 at r2 (raw file):

	}

	d.deleteObsoleteFiles(jobID, true /* waitForCompletion */)

Note that I switched this call to wait so that the comment below makes sense.. Not that it made sense with the previous code though (with pacing enabled)..

event_listener_test.go line 115 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

It looks d.mu is held here which is prohibited when calling cleanupManager.Wait(), right?

Done.

jbowens

Looks good to me, just have an open question of what to do to avoid backpressuring compactions.

Reviewed 3 of 5 files at r3, all commit messages.
Reviewable status: 12 of 17 files reviewed, 3 unresolved discussions (waiting on @RaduBerinde)

cleaner.go line 120 at r1 (raw file):

Previously, RaduBerinde wrote…

I'm hesitant to stop using the channel because the synchronization is much easier.. We could also do a blocking send, so that we wait for some jobs to finish? We already call EnqueueJob without holding the mutex.

Yeah, I think a blocking send is a good option.

compaction.go line 2636 at r1 (raw file):

Previously, RaduBerinde wrote…

But it would wait for any deletion that was just queued (e.g. from maybeScheduleObsoleteTableDeletion) to physically delete files.. If we accept that maybeScheduleObsoleteTableDeletion can run at any time, then we should treat both cases the same..

One could argue that the waiting was just an artifact of the poor design (no queue) and we really don't want to wait anywhere.. But I think we want some kind of backpressure - with a low pacing limit, the deletions could become the bottleneck and we don't want them to fall behind indefinitely..

If the pacing is limiting the rate of deletions, I think the disk-space utilization heuristic will at least prevent us from exhausting the storage. The deletion pacing exists to protect foreground write latency by avoiding sudden bursts of deletions. I don't think we actually want to backpressure compactions which can degrade foreground read performance.

IMO our current deletion pacing is a little lacking, because it doesn't adapt to disk throughput. We want to smooth the deletions over time but not outright limit them. We could consider adapting the rate according to the deletions requested over a recent long window, so we allow the rate to increase if we're deleting more files, but we still smooth it. But I think that's for future work.

compaction.go line 3583 at r1 (raw file):

Previously, RaduBerinde wrote…

We'd have to do the "conversion" from fileInfo to obsoleteFile in each place (or at least somehow associate the fileType with each entry), and in the loop we'd have to delete recycled logs from the slice. Not sure it ends up better so I'll leave it as is. I fixed the capacity.

Got it

RaduBerinde

Reviewable status: 12 of 17 files reviewed, 3 unresolved discussions (waiting on @jbowens)

cleaner.go line 120 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Yeah, I think a blocking send is a good option.

Done.

compaction.go line 2636 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

If the pacing is limiting the rate of deletions, I think the disk-space utilization heuristic will at least prevent us from exhausting the storage. The deletion pacing exists to protect foreground write latency by avoiding sudden bursts of deletions. I don't think we actually want to backpressure compactions which can degrade foreground read performance.

IMO our current deletion pacing is a little lacking, because it doesn't adapt to disk throughput. We want to smooth the deletions over time but not outright limit them. We could consider adapting the rate according to the deletions requested over a recent long window, so we allow the rate to increase if we're deleting more files, but we still smooth it. But I think that's for future work.

Ok, I can try to switch to not wait in the compaction and ingest code paths. I'll see how I can make the tests deterministic..

Should an explicit db.Compact() call wait for cleanup?

RaduBerinde · 2023-06-21T19:01:29Z

compaction.go line 2636 at r1 (raw file):

Previously, RaduBerinde wrote…

Ok, I can try to switch to not wait in the compaction and ingest code paths. I'll see how I can make the tests deterministic..

Should an explicit db.Compact() call wait for cleanup?

Hm, I guess maybe not, because we may be compacting a tiny range but there might be background cleanup across the entire key space

RaduBerinde

Reviewable status: 6 of 19 files reviewed, 3 unresolved discussions (waiting on @jbowens)

compaction.go line 2636 at r1 (raw file):

Previously, RaduBerinde wrote…

Hm, I guess maybe not, because we may be compacting a tiny range but there might be background cleanup across the entire key space

Ok, I think I got it working. For one test I had to use an "always wait" hack but for others waiting for cleanup after Compact calls and such was sufficient.

jbowens

Reviewable status: 6 of 19 files reviewed, 2 unresolved discussions (waiting on @RaduBerinde)

cleaner.go line 120 at r1 (raw file):

Previously, RaduBerinde wrote…

Done.

Thanks! #2662 validates that this is the right move, because our fixed deletion pacing rate can cause us to fall behind unboundedly on file deletions.

RaduBerinde

Reviewable status: 6 of 19 files reviewed, 2 unresolved discussions (waiting on @jbowens)

compaction.go line 2118 at r2 (raw file):

Previously, RaduBerinde wrote…

Note that I switched this call to wait so that the comment below makes sense.. Not that it made sense with the previous code though (with pacing enabled)..

I'm still not very sure about this wait here. Maybe it doesn't matter much anyway? AFAICT we only use Flush() for testing in CRDB

jbowens

Reviewable status: 6 of 19 files reviewed, 2 unresolved discussions (waiting on @RaduBerinde)

compaction.go line 2118 at r2 (raw file):

Previously, RaduBerinde wrote…

I'm still not very sure about this wait here. Maybe it doesn't matter much anyway? AFAICT we only use Flush() for testing in CRDB

Thanks for calling this out again. This function will get called for non-explicit flushes when the memtable becomes full too. Waiting will prevent new flushes from scheduling. I don't think we do want to wait here.

RaduBerinde

Reviewable status: 6 of 19 files reviewed, 2 unresolved discussions (waiting on @jbowens)

cleaner.go line 120 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Thanks! #2662 validates that this is the right move, because our fixed deletion pacing rate can cause us to fall behind unboundedly on file deletions.

Makes sense.. The only remaining place where we wait is during Open().. I don't think we want to wait there either, do we?

The code around obsolete file deletion is very convoluted and there is one known race (described in cockroachdb#2630). This change reworks the deletion code as follows: - we add a cleanup manager which has a background goroutine. - all file and object deletions happen in the manager's background goroutine. - callers can enqueue cleanup jobs with the manager - tests can wait for outstanding cleanup jobs to finish We also add a missing call to `deleteObsoleteObjects` from the ingest path (which now can cause files to be removed through excising). Finally, we fix an incorrect `Remove` call in the error path of `Checkpoint()` (which was missing a `PathJoin`). Fixes cockroachdb#2630.

RaduBerinde · 2023-06-22T21:49:34Z

I removed the wait flag altogether; we now never wait for cleanup except for some tests.

jbowens

Reviewed 6 of 11 files at r4, 7 of 7 files at r5, all commit messages.
Reviewable status: 19 of 22 files reviewed, 1 unresolved discussion (waiting on @RaduBerinde)

RaduBerinde · 2023-06-26T19:32:45Z

TFTR!

Before cockroachdb#2641, we had a conditional to explicitly schedule deletion of obsolete files in db.Close(). We took it out as part of the deletion refactor as the comment on it made it seem no longer necessary. However, the code for readState.unrefLocked(), which we call earlier in Close(), explicitly states that it does not schedule obsolete file deletion and that it expects the caller to schedule that. Since we're calling readState.unrefLocked() here, we should be doing said deletion. Fixes cockroachdb#2708.

Before #2641, we had a conditional to explicitly schedule deletion of obsolete files in db.Close(). We took it out as part of the deletion refactor as the comment on it made it seem no longer necessary. However, the code for readState.unrefLocked(), which we call earlier in Close(), explicitly states that it does not schedule obsolete file deletion and that it expects the caller to schedule that. Since we're calling readState.unrefLocked() here, we should be doing said deletion. Fixes #2708.

RaduBerinde requested review from jbowens and a team June 16, 2023 02:30

jbowens reviewed Jun 16, 2023

View reviewed changes

RaduBerinde force-pushed the cleanup-manager branch from 07877ee to a01ee60 Compare June 17, 2023 00:07

RaduBerinde commented Jun 17, 2023

View reviewed changes

RaduBerinde force-pushed the cleanup-manager branch from a01ee60 to 7ef9fbe Compare June 19, 2023 00:58

RaduBerinde requested a review from a team June 21, 2023 16:49

jbowens approved these changes Jun 21, 2023

View reviewed changes

RaduBerinde commented Jun 21, 2023

View reviewed changes

RaduBerinde force-pushed the cleanup-manager branch from 7ef9fbe to c0b26f2 Compare June 21, 2023 20:18

RaduBerinde commented Jun 21, 2023

View reviewed changes

jbowens mentioned this pull request Jun 21, 2023

db: adaptively throttle file deletions #2662

Closed

jbowens approved these changes Jun 22, 2023

View reviewed changes

RaduBerinde commented Jun 22, 2023

View reviewed changes

jbowens reviewed Jun 22, 2023

View reviewed changes

RaduBerinde commented Jun 22, 2023

View reviewed changes

RaduBerinde force-pushed the cleanup-manager branch from c0b26f2 to be00bcc Compare June 22, 2023 21:48

RaduBerinde mentioned this pull request Jun 24, 2023

db: improve delete pacing #2673

Merged

jbowens approved these changes Jun 26, 2023

View reviewed changes

RaduBerinde merged commit 74b3421 into cockroachdb:master Jun 26, 2023

RaduBerinde deleted the cleanup-manager branch June 26, 2023 19:32

itsbilal mentioned this pull request Jul 11, 2023

db: Bring back call to deleteObsoleteFiles in Close() #2728

Merged

jbowens mentioned this pull request Dec 5, 2023

compaction: async deletion of files can result in many goroutines #2379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: refactor obsolete file deletion code #2641

db: refactor obsolete file deletion code #2641

RaduBerinde commented Jun 16, 2023 •

edited

Loading

cockroach-teamcity commented Jun 16, 2023

jbowens left a comment

RaduBerinde left a comment

jbowens left a comment

RaduBerinde left a comment

RaduBerinde commented Jun 21, 2023

RaduBerinde left a comment

jbowens left a comment

RaduBerinde left a comment

jbowens left a comment

RaduBerinde left a comment

RaduBerinde commented Jun 22, 2023

jbowens left a comment

RaduBerinde commented Jun 26, 2023

db: refactor obsolete file deletion code #2641

db: refactor obsolete file deletion code #2641

Conversation

RaduBerinde commented Jun 16, 2023 • edited Loading

cockroach-teamcity commented Jun 16, 2023

jbowens left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jun 21, 2023

RaduBerinde left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jun 22, 2023

jbowens left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jun 26, 2023

RaduBerinde commented Jun 16, 2023 •

edited

Loading