Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: TestSharedObjectDeletePacing is flaky #2630

Closed
nicktrav opened this issue Jun 12, 2023 · 8 comments · Fixed by #2641
Closed

db: TestSharedObjectDeletePacing is flaky #2630

nicktrav opened this issue Jun 12, 2023 · 8 comments · Fixed by #2641

Comments

@nicktrav
Copy link
Contributor

Seen here: https://github.com/cockroachdb/pebble/actions/runs/5247409884/jobs/9477540951

Reproduces on Linux, locally:

$ go test -run TestSharedObjectDeletePacing -count 1 -exec 'stress ' -timeout 0 -test.v .
...
=== RUN   TestSharedObjectDeletePacing
2023/06/12 16:20:38 shared storage configured; no creatorID yet
2023/06/12 16:20:38 shared storage creatorID set to 1
    compaction_test.go:3979: compaction timed out, possibly due to incorrect deletion pacing
--- FAIL: TestSharedObjectDeletePacing (60.00s)
FAIL


ERROR: exit status 1

1m10s: 8200 runs so far, 14 failures (0.17%)
@nicktrav
Copy link
Contributor Author

cc: @RaduBerinde

@RaduBerinde RaduBerinde self-assigned this Jun 13, 2023
@RaduBerinde
Copy link
Member

@jbowens In this test I call Set + Compact for a bunch of keys, then I DeleteRange + Compact a subset of those keys, expecting some sstables to be deleted. Sometimes the compaction doesn't result in any obsolete tables. Do I have the wrong expectations about what Compact() is supposed to do? Can you think off the top of your head what am I missing?

for i := 1; i <= numKeys; i++ {

@jbowens
Copy link
Collaborator

jbowens commented Jun 13, 2023

The test looks reasonable to me.

The comparison of the number of objects broadly has the potential to be brittle, because the number of sstables and where they're split is very timing dependent. Range tombstones can be fragmented into several sstables and may linger in the LSM if they're move-compacted. But no particular scenario comes to mind that would actually preserve numKeys-2 sstables.

An alternative structure could add a TableDeleted event listener to ensure that tables are being deleted (and maybe verify in the provider that they're gone):

		EventListener: &EventListener{
			TableDeleted: func(info TableDeleteInfo) {
				// ...
			},
		},

@RaduBerinde
Copy link
Member

Yes, I can make it less brittle by comparing the before and after filenums. Unfortunately, every now and then the compaction doesn't delete any tables (so it's not a matter of having a brittle check); I'll try to figure out why.

@RaduBerinde
Copy link
Member

It looks like the compaction does delete the files from the new version but the files themselves don't end up being deleted. Something must be holding on to the previous version perhaps..

@RaduBerinde
Copy link
Member

It looks like collectTableStats is running at the same time with the compaction. That causes the readState to have an outstanding reference when it is unrefed by the compaction. When the table stats collection finishes, we do readState.unref and that runs maybeScheduleObsoleteTableDeletion. But in the latter acquireCleaningTurn fails because the compaction is doing deleteObsoleteFiles (even though it has nothing to delete).

It seems that we have a bug here.. Any suggestions on how to fix? In maybeScheduleObsoleteTableDeletion, we could move acquireCleaningTurn into the goroutine and set waitForOngoing=true?

@jbowens
Copy link
Collaborator

jbowens commented Jun 13, 2023

My mind melts every time I try to wrap my head around maybeScheduleObsoleteTableDeletion, deleteObsoleteFiles/doDeleteObsoleteFiles, acquireCleaningTurn/releaseCleaningTurn, enableFileDeletions/disableFileDeletions, paceAndDeleteObsoleteFiles and how they all interact.

Not that we need to do it now, but maybe we should consider a dedicated cleaning goroutine.

@RaduBerinde
Copy link
Member

I agree - I'll give it a shot.

RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Jun 16, 2023
The code around obsolete file deletion is very convoluted and there is
one known race (described in cockroachdb#2630).

This change reworks the deletion code as follows:
 - we add a cleanup manager which has a background goroutine.
 - all file and object deletions happen in the manager's background
   goroutine.
 - callers can enqueue cleanup jobs with the manager
 - callers can wait for outstanding cleanup jobs to finish

We also add a missing call to `deleteObsoleteObjects` from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect `Remove` call in the error path of
`Checkpoint()` (which was missing a `PathJoin`).

Fixes cockroachdb#2630.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Jun 17, 2023
The code around obsolete file deletion is very convoluted and there is
one known race (described in cockroachdb#2630).

This change reworks the deletion code as follows:
 - we add a cleanup manager which has a background goroutine.
 - all file and object deletions happen in the manager's background
   goroutine.
 - callers can enqueue cleanup jobs with the manager
 - callers can wait for outstanding cleanup jobs to finish

We also add a missing call to `deleteObsoleteObjects` from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect `Remove` call in the error path of
`Checkpoint()` (which was missing a `PathJoin`).

Fixes cockroachdb#2630.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Jun 19, 2023
The code around obsolete file deletion is very convoluted and there is
one known race (described in cockroachdb#2630).

This change reworks the deletion code as follows:
 - we add a cleanup manager which has a background goroutine.
 - all file and object deletions happen in the manager's background
   goroutine.
 - callers can enqueue cleanup jobs with the manager
 - callers can wait for outstanding cleanup jobs to finish

We also add a missing call to `deleteObsoleteObjects` from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect `Remove` call in the error path of
`Checkpoint()` (which was missing a `PathJoin`).

Fixes cockroachdb#2630.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Jun 21, 2023
The code around obsolete file deletion is very convoluted and there is
one known race (described in cockroachdb#2630).

This change reworks the deletion code as follows:
 - we add a cleanup manager which has a background goroutine.
 - all file and object deletions happen in the manager's background
   goroutine.
 - callers can enqueue cleanup jobs with the manager
 - callers can wait for outstanding cleanup jobs to finish

We also add a missing call to `deleteObsoleteObjects` from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect `Remove` call in the error path of
`Checkpoint()` (which was missing a `PathJoin`).

Fixes cockroachdb#2630.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Jun 22, 2023
The code around obsolete file deletion is very convoluted and there is
one known race (described in cockroachdb#2630).

This change reworks the deletion code as follows:
 - we add a cleanup manager which has a background goroutine.
 - all file and object deletions happen in the manager's background
   goroutine.
 - callers can enqueue cleanup jobs with the manager
 - callers can wait for outstanding cleanup jobs to finish

We also add a missing call to `deleteObsoleteObjects` from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect `Remove` call in the error path of
`Checkpoint()` (which was missing a `PathJoin`).

Fixes cockroachdb#2630.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Jun 22, 2023
The code around obsolete file deletion is very convoluted and there is
one known race (described in cockroachdb#2630).

This change reworks the deletion code as follows:
 - we add a cleanup manager which has a background goroutine.
 - all file and object deletions happen in the manager's background
   goroutine.
 - callers can enqueue cleanup jobs with the manager
 - tests can wait for outstanding cleanup jobs to finish

We also add a missing call to `deleteObsoleteObjects` from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect `Remove` call in the error path of
`Checkpoint()` (which was missing a `PathJoin`).

Fixes cockroachdb#2630.
RaduBerinde added a commit that referenced this issue Jun 26, 2023
The code around obsolete file deletion is very convoluted and there is
one known race (described in #2630).

This change reworks the deletion code as follows:
 - we add a cleanup manager which has a background goroutine.
 - all file and object deletions happen in the manager's background
   goroutine.
 - callers can enqueue cleanup jobs with the manager
 - tests can wait for outstanding cleanup jobs to finish

We also add a missing call to `deleteObsoleteObjects` from the ingest
path (which now can cause files to be removed through excising).

Finally, we fix an incorrect `Remove` call in the error path of
`Checkpoint()` (which was missing a `PathJoin`).

Fixes #2630.
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants