Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: refactor obsolete file deletion code #2641

Merged
merged 1 commit into from
Jun 26, 2023

Conversation

RaduBerinde
Copy link
Member

@RaduBerinde RaduBerinde commented Jun 16, 2023

The code around obsolete file deletion is very convoluted and there is
one known race (described in #2630).

This change reworks the deletion code as follows:

  • we add a cleanup manager which has a background goroutine.
  • all file and object deletions happen in the manager's background
    goroutine.
  • callers can enqueue cleanup jobs with the manager
  • tests can wait for outstanding cleanup jobs to finish

We also add a missing call to deleteObsoleteObjects from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect Remove call in the error path of
Checkpoint() (which was missing a PathJoin).

Fixes #2630.

@RaduBerinde RaduBerinde requested review from jbowens and a team June 16, 2023 02:30
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Reviewed 11 of 17 files at r1, all commit messages.
Reviewable status: 11 of 17 files reviewed, 8 unresolved discussions (waiting on @RaduBerinde)


cleaner.go line 50 at r1 (raw file):

// In practice, we should rarely have more than a couple of jobs (in most cases
// we Wait() after queueing a job).
const jobsChLen = 10000

is this maybe too big? With 8-byte words, 80KB?


cleaner.go line 103 at r1 (raw file):

// Note that logs in obsoleteFiles might be recycled instead of deleted.
//
// Returns a channel that is closed when the job completes.

Looks like this comment is out-of-date.


cleaner.go line 120 at r1 (raw file):

		}
		// Something is terribly wrong... Just drop the job.
		cm.opts.Logger.Infof("cleanup jobs queue full")

this makes me nervous even though it shouldn't happen. since enqueues are already mutex-protected, should we use a preallocated slice instead of a channel? if the slice becomes full, we can always suffer the allocation to grow it.


compaction.go line 2636 at r1 (raw file):

		d.updateTableStatsLocked(ve.NewFiles)
	}
	d.deleteObsoleteFiles(jobID, true /* waitForCompletion */)

I think this is a behavior change. Previously, with deletion pacing enabled,waitForOngoing would determine whether or not we should wait for our turn to call doDeleteObsoleteFiles, but that function wouldn't wait for the files to physically be deleted.


compaction.go line 3488 at r1 (raw file):

	d.mu.Unlock()
	defer d.mu.Lock()
	d.cleanupManager.Wait()

I didn't follow this. why do we wait for the cleaner but only if d.mu.disableFileDeletions <= 1? couldn't another goroutine have just incremented disableFileDeletions and dropped the mu to wait for the cleaner. Then when the next goroutine increments, it avoids waiting. Shouldn't they receive the same treatment?


compaction.go line 3569 at r1 (raw file):

	// Release d.mu while doing I/O
	// Note the unusual order: Unlock and then Lock.

this function no longer does I/O directly, right? Should we update this comment?


compaction.go line 3583 at r1 (raw file):

	}
	_, noRecycle := d.opts.Cleaner.(base.NeedsFileContents)
	filesToDelete := make([]obsoleteFile, 0, len(files))

looking at this existing code, is there any reason for this files array? couldn't we populate a single filesToDelete slice up above, and then down here perform all the same logic we do within the below for loop?

relatedly, I think this capacity for the filesToDelete slice wasn't the intended capacity. len(files) is always 4.


event_listener_test.go line 115 at r1 (raw file):

				defer d.mu.Unlock()
				d.enableFileDeletions()
				d.TestOnlyWaitForCleaning()

It looks d.mu is held here which is prohibited when calling cleanupManager.Wait(), right?

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 17 files reviewed, 9 unresolved discussions (waiting on @jbowens)


cleaner.go line 50 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

is this maybe too big? With 8-byte words, 80KB?

Reduced to 1000


cleaner.go line 103 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Looks like this comment is out-of-date.

Done.


cleaner.go line 120 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this makes me nervous even though it shouldn't happen. since enqueues are already mutex-protected, should we use a preallocated slice instead of a channel? if the slice becomes full, we can always suffer the allocation to grow it.

I'm hesitant to stop using the channel because the synchronization is much easier.. We could also do a blocking send, so that we wait for some jobs to finish? We already call EnqueueJob without holding the mutex.


compaction.go line 2636 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think this is a behavior change. Previously, with deletion pacing enabled,waitForOngoing would determine whether or not we should wait for our turn to call doDeleteObsoleteFiles, but that function wouldn't wait for the files to physically be deleted.

But it would wait for any deletion that was just queued (e.g. from maybeScheduleObsoleteTableDeletion) to physically delete files.. If we accept that maybeScheduleObsoleteTableDeletion can run at any time, then we should treat both cases the same..

One could argue that the waiting was just an artifact of the poor design (no queue) and we really don't want to wait anywhere.. But I think we want some kind of backpressure - with a low pacing limit, the deletions could become the bottleneck and we don't want them to fall behind indefinitely..


compaction.go line 3488 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I didn't follow this. why do we wait for the cleaner but only if d.mu.disableFileDeletions <= 1? couldn't another goroutine have just incremented disableFileDeletions and dropped the mu to wait for the cleaner. Then when the next goroutine increments, it avoids waiting. Shouldn't they receive the same treatment?

Nice catch, fixed.


compaction.go line 3569 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this function no longer does I/O directly, right? Should we update this comment?

Done.


compaction.go line 3583 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

looking at this existing code, is there any reason for this files array? couldn't we populate a single filesToDelete slice up above, and then down here perform all the same logic we do within the below for loop?

relatedly, I think this capacity for the filesToDelete slice wasn't the intended capacity. len(files) is always 4.

We'd have to do the "conversion" from fileInfo to obsoleteFile in each place (or at least somehow associate the fileType with each entry), and in the loop we'd have to delete recycled logs from the slice. Not sure it ends up better so I'll leave it as is. I fixed the capacity.


compaction.go line 2118 at r2 (raw file):

	}

	d.deleteObsoleteFiles(jobID, true /* waitForCompletion */)

Note that I switched this call to wait so that the comment below makes sense.. Not that it made sense with the previous code though (with pacing enabled)..


event_listener_test.go line 115 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

It looks d.mu is held here which is prohibited when calling cleanupManager.Wait(), right?

Done.

@RaduBerinde RaduBerinde requested a review from a team June 21, 2023 16:49
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just have an open question of what to do to avoid backpressuring compactions.

Reviewed 3 of 5 files at r3, all commit messages.
Reviewable status: 12 of 17 files reviewed, 3 unresolved discussions (waiting on @RaduBerinde)


cleaner.go line 120 at r1 (raw file):

Previously, RaduBerinde wrote…

I'm hesitant to stop using the channel because the synchronization is much easier.. We could also do a blocking send, so that we wait for some jobs to finish? We already call EnqueueJob without holding the mutex.

Yeah, I think a blocking send is a good option.


compaction.go line 2636 at r1 (raw file):

Previously, RaduBerinde wrote…

But it would wait for any deletion that was just queued (e.g. from maybeScheduleObsoleteTableDeletion) to physically delete files.. If we accept that maybeScheduleObsoleteTableDeletion can run at any time, then we should treat both cases the same..

One could argue that the waiting was just an artifact of the poor design (no queue) and we really don't want to wait anywhere.. But I think we want some kind of backpressure - with a low pacing limit, the deletions could become the bottleneck and we don't want them to fall behind indefinitely..

If the pacing is limiting the rate of deletions, I think the disk-space utilization heuristic will at least prevent us from exhausting the storage. The deletion pacing exists to protect foreground write latency by avoiding sudden bursts of deletions. I don't think we actually want to backpressure compactions which can degrade foreground read performance.

IMO our current deletion pacing is a little lacking, because it doesn't adapt to disk throughput. We want to smooth the deletions over time but not outright limit them. We could consider adapting the rate according to the deletions requested over a recent long window, so we allow the rate to increase if we're deleting more files, but we still smooth it. But I think that's for future work.


compaction.go line 3583 at r1 (raw file):

Previously, RaduBerinde wrote…

We'd have to do the "conversion" from fileInfo to obsoleteFile in each place (or at least somehow associate the fileType with each entry), and in the loop we'd have to delete recycled logs from the slice. Not sure it ends up better so I'll leave it as is. I fixed the capacity.

Got it

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 17 files reviewed, 3 unresolved discussions (waiting on @jbowens)


cleaner.go line 120 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Yeah, I think a blocking send is a good option.

Done.


compaction.go line 2636 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

If the pacing is limiting the rate of deletions, I think the disk-space utilization heuristic will at least prevent us from exhausting the storage. The deletion pacing exists to protect foreground write latency by avoiding sudden bursts of deletions. I don't think we actually want to backpressure compactions which can degrade foreground read performance.

IMO our current deletion pacing is a little lacking, because it doesn't adapt to disk throughput. We want to smooth the deletions over time but not outright limit them. We could consider adapting the rate according to the deletions requested over a recent long window, so we allow the rate to increase if we're deleting more files, but we still smooth it. But I think that's for future work.

Ok, I can try to switch to not wait in the compaction and ingest code paths. I'll see how I can make the tests deterministic..

Should an explicit db.Compact() call wait for cleanup?

@RaduBerinde
Copy link
Member Author

compaction.go line 2636 at r1 (raw file):

Previously, RaduBerinde wrote…

Ok, I can try to switch to not wait in the compaction and ingest code paths. I'll see how I can make the tests deterministic..

Should an explicit db.Compact() call wait for cleanup?

Hm, I guess maybe not, because we may be compacting a tiny range but there might be background cleanup across the entire key space

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 19 files reviewed, 3 unresolved discussions (waiting on @jbowens)


compaction.go line 2636 at r1 (raw file):

Previously, RaduBerinde wrote…

Hm, I guess maybe not, because we may be compacting a tiny range but there might be background cleanup across the entire key space

Ok, I think I got it working. For one test I had to use an "always wait" hack but for others waiting for cleanup after Compact calls and such was sufficient.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 6 of 19 files reviewed, 2 unresolved discussions (waiting on @RaduBerinde)


cleaner.go line 120 at r1 (raw file):

Previously, RaduBerinde wrote…

Done.

Thanks! #2662 validates that this is the right move, because our fixed deletion pacing rate can cause us to fall behind unboundedly on file deletions.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 19 files reviewed, 2 unresolved discussions (waiting on @jbowens)


compaction.go line 2118 at r2 (raw file):

Previously, RaduBerinde wrote…

Note that I switched this call to wait so that the comment below makes sense.. Not that it made sense with the previous code though (with pacing enabled)..

I'm still not very sure about this wait here. Maybe it doesn't matter much anyway? AFAICT we only use Flush() for testing in CRDB

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 19 files reviewed, 2 unresolved discussions (waiting on @RaduBerinde)


compaction.go line 2118 at r2 (raw file):

Previously, RaduBerinde wrote…

I'm still not very sure about this wait here. Maybe it doesn't matter much anyway? AFAICT we only use Flush() for testing in CRDB

Thanks for calling this out again. This function will get called for non-explicit flushes when the memtable becomes full too. Waiting will prevent new flushes from scheduling. I don't think we do want to wait here.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 19 files reviewed, 2 unresolved discussions (waiting on @jbowens)


cleaner.go line 120 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Thanks! #2662 validates that this is the right move, because our fixed deletion pacing rate can cause us to fall behind unboundedly on file deletions.

Makes sense.. The only remaining place where we wait is during Open().. I don't think we want to wait there either, do we?

The code around obsolete file deletion is very convoluted and there is
one known race (described in cockroachdb#2630).

This change reworks the deletion code as follows:
 - we add a cleanup manager which has a background goroutine.
 - all file and object deletions happen in the manager's background
   goroutine.
 - callers can enqueue cleanup jobs with the manager
 - tests can wait for outstanding cleanup jobs to finish

We also add a missing call to `deleteObsoleteObjects` from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect `Remove` call in the error path of
`Checkpoint()` (which was missing a `PathJoin`).

Fixes cockroachdb#2630.
@RaduBerinde
Copy link
Member Author

I removed the wait flag altogether; we now never wait for cleanup except for some tests.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 11 files at r4, 7 of 7 files at r5, all commit messages.
Reviewable status: 19 of 22 files reviewed, 1 unresolved discussion (waiting on @RaduBerinde)

@RaduBerinde
Copy link
Member Author

TFTR!

@RaduBerinde RaduBerinde merged commit 74b3421 into cockroachdb:master Jun 26, 2023
@RaduBerinde RaduBerinde deleted the cleanup-manager branch June 26, 2023 19:32
itsbilal added a commit to itsbilal/pebble that referenced this pull request Jul 11, 2023
Before cockroachdb#2641, we had a conditional to explicitly schedule deletion of
obsolete files in db.Close(). We took it out as part of the
deletion refactor as the comment on it made it seem no longer necessary.
However, the code for readState.unrefLocked(), which we call earlier in Close(),
explicitly states that it does not schedule obsolete file deletion and
that it expects the caller to schedule that. Since we're calling
readState.unrefLocked() here, we should be doing said deletion.

Fixes cockroachdb#2708.
itsbilal added a commit that referenced this pull request Jul 11, 2023
Before #2641, we had a conditional to explicitly schedule deletion of
obsolete files in db.Close(). We took it out as part of the
deletion refactor as the comment on it made it seem no longer necessary.
However, the code for readState.unrefLocked(), which we call earlier in Close(),
explicitly states that it does not schedule obsolete file deletion and
that it expects the caller to schedule that. Since we're calling
readState.unrefLocked() here, we should be doing said deletion.

Fixes #2708.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

db: TestSharedObjectDeletePacing is flaky
3 participants