Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptance: TestDockerCLI_test_demo_multitenant shutdown synchronization #110645

Closed
rafiss opened this issue Sep 14, 2023 · 3 comments
Closed

acceptance: TestDockerCLI_test_demo_multitenant shutdown synchronization #110645

rafiss opened this issue Sep 14, 2023 · 3 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). T-storage Storage Team

Comments

@rafiss
Copy link
Collaborator

rafiss commented Sep 14, 2023

Failed here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Ci_Tests_Acceptance/11765319?buildTab=artifacts#

The PR (#110595) is changing the text of an error message in SQL, so it is not related.

I230913 19:30:19.875967 16349 1@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6401  the server is terminating due to a fatal error (see the STORAGE channel for details)
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402  000410.sst:
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 +orig err: open ‹000410.sst›: file does not exist
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 +list err: open ‹./›: file does not exist
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !goroutine 16349 [running]:
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !runtime/debug.Stack()
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	GOROOT/src/runtime/debug/stack.go:24 +0x65
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(_, {{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}}, ...})
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/cockroach/pkg/util/log/clog.go:274 +0xb8
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x793d508, 0xc00ad84ea0}, 0x3, 0x4, 0x3, 0x0?, {0x6364e88, 0x1d}, {0xc00a95c270, 0x3, ...})
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:109 +0x5a5
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/cockroach/pkg/util/log.loggerStorage.FatalfDepth(...)
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/cockroach/bazel-out/k8-fastbuild/bin/pkg/util/log/log_channels_generated.go:2223
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/cockroach/pkg/storage.pebbleLogger.Fatalf({{0x793d508?, 0xc00ad84ea0?}, 0x0?}, {0x6364e88?, 0x46a9a7?}, {0xc00a95c270?, 0x57a5ee0?, 0xc006482801?})
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/cockroach/pkg/storage/pebble.go:712 +0x56
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble/internal/base.MustExist({0x79b6600, 0xc008385920}, {0xc00e788060, 0xa}, {0x7f93acebfb50, 0xc00a50a918}, {0x78fde60, 0xc00a95c1b0})
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/internal/base/external/com_github_cockroachdb_pebble/internal/base/filenames.go:172 +0x283
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble/objstorage/objstorageprovider.(*provider).vfsOpenForReading(0xc010bc9800, {0x2?, 0x0?}, 0x0?, {0x0?}, {0x0?})
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/objstorage/objstorageprovider/external/com_github_cockroachdb_pebble/objstorage/objstorageprovider/vfs.go:30 +0x145
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble/objstorage/objstorageprovider.(*provider).OpenForReading(0xc010bc9800, {0x793d498, 0xc0000bc070}, 0x5f0fcc0?, {0x93ab949108?}, {0x80?})
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/objstorage/objstorageprovider/external/com_github_cockroachdb_pebble/objstorage/objstorageprovider/provider.go:225 +0x35a
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble.(*tableCacheValue).load(0xc00a95c0c0, {{0xc006482cc8?}, 0xee3b6f?, 0xc00a95c120?}, 0xc005806790, 0xc0047695f8)
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:1056 +0xa3
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble.(*tableCacheShard).findNode.func3({0x793d508, 0xc00a95c120})
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:816 +0x3c
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !runtime/pprof.Do({0x793d498?, 0xc0000bc058?}, {{0xc000620780?, 0xc006482b98?, 0xc006482dd8?}}, 0xc006482db0)
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	GOROOT/src/runtime/pprof/runtime.go:44 +0xa3
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc005806790, 0xc010bc8f00, 0xc0047695f8)
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:815 +0x4c5
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble.(*tableCacheContainer).withReader(0x100000000003c?, {0xc00d4f5f30?}, 0xc006482ee0)
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:199 +0x6d
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble.(*DB).loadTableStats(0xc006483080?, 0x3e48?, 0xc00e0f7380?, {0xf?})
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_stats.go:283 +0xe5
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble.(*DB).scanReadStateTableStats(0xc00fc2c000, 0xc0066328d0, {0xc00afdd4c0?, 0x0?, 0xc00afdd370?})
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_stats.go:255 +0x3b5
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !github.com/cockroachdb/pebble.(*DB).collectTableStats(0xc00fc2c000)
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_stats.go:110 +0x245
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !created by github.com/cockroachdb/pebble.(*DB).maybeCollectTableStatsLocked
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_stats.go:46 +0xcc

Jira issue: CRDB-31541

@rafiss rafiss added C-test-failure Broken test (automatically or manually discovered). branch-master Failures and bugs on the master branch. T-storage Storage Team labels Sep 14, 2023
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Sep 14, 2023
@jbowens
Copy link
Collaborator

jbowens commented Sep 14, 2023

I230913 19:30:19.870819 16350 3@pebble/event.go:699 â‹® [n?,s?,pebble] 6388  [JOB 3] compacting(default) L0 [000413 000405 000410 000406 000407 000408 000409 000415] (30KB) Score=300.00 + L6 [000404] (281KB) Score=0.00; OverlappingRatio: Single 9.20, Multi 0.00 
I230913 19:30:19.874899 16230 3@pebble/event.go:703 â‹® [n4,s4,pebble] 6389  [JOB 362] compacted(default) L0 [000413 000405 000410 000406 000407 000408 000409] (26KB) Score=200.00 + L6 [000404] (281KB) Score=0.00 -> L6 [000414] (299KB), in 0.1s (0.1s total), output rate 3.5MB/s
I230913 19:30:19.875187 11465 3@pebble/event.go:735 â‹® [n4,s4,pebble] 6390  [JOB 362] sstable deleted 000404
I230913 19:30:19.875254 11465 3@pebble/event.go:735 â‹® [n4,s4,pebble] 6391  [JOB 362] sstable deleted 000405
I230913 19:30:19.875281 11465 3@pebble/event.go:735 â‹® [n4,s4,pebble] 6392  [JOB 362] sstable deleted 000406
I230913 19:30:19.875293 11465 3@pebble/event.go:735 â‹® [n4,s4,pebble] 6393  [JOB 362] sstable deleted 000407
I230913 19:30:19.875305 11465 3@pebble/event.go:735 â‹® [n4,s4,pebble] 6394  [JOB 362] sstable deleted 000408
I230913 19:30:19.875317 11465 3@pebble/event.go:735 â‹® [n4,s4,pebble] 6395  [JOB 362] sstable deleted 000409
I230913 19:30:19.875328 11465 3@pebble/event.go:735 â‹® [n4,s4,pebble] 6396  [JOB 362] sstable deleted 000410
I230913 19:30:19.875339 11465 3@pebble/event.go:735 â‹® [n4,s4,pebble] 6397  [JOB 362] sstable deleted 000413
I230913 19:30:19.875554 15724 server/server_controller_orchestration.go:251 â‹® [T1,Vsystem,n4] 6398  server controller shutting down ungracefully
I230913 19:30:19.875596 15724 server/server_controller_orchestration.go:262 â‹® [T1,Vsystem,n4] 6399  waiting for tenant servers to report stopped
W230913 19:30:19.875676 15724 server/server_sql.go:1705 â‹® [T1,Vsystem,n4] 6400  server shutdown without a prior graceful drain
I230913 19:30:19.875967 16349 1@pebble/internal/base/filenames.go:172 â‹® [n?,s?,pebble] 6401  the server is terminating due to a fatal error (see the STORAGE channel for details)
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 â‹® [n?,s?,pebble] 6402  000410.sst:
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 +orig err: open ‹000410.sst›: file does not exist
F230913 19:30:19.875998 16349 3@pebble/internal/base/filenames.go:172 ⋮ [n?,s?,pebble] 6402 +list err: open ‹./›: file does not exist

Is it true that this test always creates in-memory Engines? In which case it looks like there's there's a race condition in file reference counting within Pebble, allowing the file 000410.sst to be deleted while the table stats collector still has a reference on a readState that contains the file. This is still a little difficult to believe. We'd expect much more widespread failures. Maybe the race is somehow specific to the initial version loaded from the manifest? In this instance, the table stats collector is still performing its initial scans of the database tables.

The list err: open ./: file does not exist secondary error is bizarre and indicates that after the file was found to not exist, Pebble tried to list the store directory and found that the entire store directory did not exist. Did something remove the store directory or are the paths incorrect?

jbowens added a commit to jbowens/pebble that referenced this issue Sep 14, 2023
Double check the file reference counts before attempting to find/create a table
cache node for a file. Once a file's reference count falls to zero, the file
becomes obsolete and may be deleted at any moment.

Today if we have a race, break this invariant and attempt to load a file with a
nonpositive reference count, it's relatively unlikely that it manifests as an
error. Typically tables remain open in the table cache, allowing the table
cache to serve the request even if the file is no longer linked into the data
directory. Additionally, even if it's not in the table cache presently,
deletion of obsolete files may be delayed due to deletion pacing, hiding the
race.

This commit preemptively asserts on the file reference counts. I opted for not
restricting this invariant check to invariants builds because it's cheap
relative to a table cache lookup, and it's a particularly tricky form of
corruption to debug otherwise.

Informs cockroachdb/cockroach#110645.
@jbowens
Copy link
Collaborator

jbowens commented Sep 14, 2023

I230913 19:30:19.870819 16350 3@pebble/event.go:699 â‹® [n?,s?,pebble] 6388  [JOB 3] compacting(default) L0 [000413 000405 000410 000406 000407 000408 000409 000415] (30KB) Score=300.00 + L6 [000404] (281KB) Score=0.00; OverlappingRatio: Single 9.20, Multi 0.00 
I230913 19:30:19.874899 16230 3@pebble/event.go:703 â‹® [n4,s4,pebble] 6389  [JOB 362] compacted(default) L0 [000413 000405 000410 000406 000407 000408 000409] (26KB) Score=200.00 + L6 [000404] (281KB) Score=0.00 -> L6 [000414] (299KB), in 0.1s (0.1s total), output rate 3.5MB/s
I230913 19:30:19.875187 11465 3@pebble/event.go:735 â‹® [n4,s4,pebble] 6390  [JOB 362] sstabl

I missed this initially: I'm pretty sure this test is opening the same store multiple times. In these logs we see that a high-numbered compaction job 362 is running at the same time as low-numbered job 3 compaction and most of the constituent files are the same.

jbowens added a commit to jbowens/pebble that referenced this issue Sep 14, 2023
Previously MemFS's Lock method did not implement mutual exclusion under the
assumption that databases are given separate processes which do not share
memory. This commit adapts Lock to enforce mutual exclusion, preventing
accidentally using same FS concurrently from more than one Pebble instance.
It's believed that cockroachdb/cockroach#110645 is the result of Cockroach
acceptance tests opening a MemFS store twice, with one Pebble instance deleting
files from beneath another Pebble instance.
jbowens added a commit to jbowens/pebble that referenced this issue Sep 14, 2023
Previously MemFS's Lock method did not implement mutual exclusion under the
assumption that databases are given separate processes which do not share
memory. This commit adapts Lock to enforce mutual exclusion, preventing
accidentally using same FS concurrently from more than one Pebble instance.
It's believed that cockroachdb/cockroach#110645 is the result of Cockroach
acceptance tests opening a MemFS store twice, with one Pebble instance deleting
files from beneath another Pebble instance.
@jbowens jbowens self-assigned this Sep 14, 2023
jbowens added a commit to cockroachdb/pebble that referenced this issue Sep 15, 2023
Double check the file reference counts before attempting to find/create a table
cache node for a file. Once a file's reference count falls to zero, the file
becomes obsolete and may be deleted at any moment.

Today if we have a race, break this invariant and attempt to load a file with a
nonpositive reference count, it's relatively unlikely that it manifests as an
error. Typically tables remain open in the table cache, allowing the table
cache to serve the request even if the file is no longer linked into the data
directory. Additionally, even if it's not in the table cache presently,
deletion of obsolete files may be delayed due to deletion pacing, hiding the
race.

This commit preemptively asserts on the file reference counts. I opted for not
restricting this invariant check to invariants builds because it's cheap
relative to a table cache lookup, and it's a particularly tricky form of
corruption to debug otherwise.

Informs cockroachdb/cockroach#110645.
jbowens added a commit to jbowens/pebble that referenced this issue Sep 15, 2023
Previously MemFS's Lock method did not implement mutual exclusion under the
assumption that databases are given separate processes which do not share
memory. This commit adapts Lock to enforce mutual exclusion, preventing
accidentally using same FS concurrently from more than one Pebble instance.
It's believed that cockroachdb/cockroach#110645 is the result of Cockroach
acceptance tests opening a MemFS store twice, with one Pebble instance deleting
files from beneath another Pebble instance.
jbowens added a commit to cockroachdb/pebble that referenced this issue Sep 15, 2023
Previously MemFS's Lock method did not implement mutual exclusion under the
assumption that databases are given separate processes which do not share
memory. This commit adapts Lock to enforce mutual exclusion, preventing
accidentally using same FS concurrently from more than one Pebble instance.
It's believed that cockroachdb/cockroach#110645 is the result of Cockroach
acceptance tests opening a MemFS store twice, with one Pebble instance deleting
files from beneath another Pebble instance.
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 15, 2023
```
6d6570bf vfs: enforce mutual exclusion in MemFS.Lock
336c9979 metamorphic: vary sstable compression algorithm
c91e8796 db: double check file reference counts when loading file
22fbb69a scripts: generate code coverage with invariants build tag
529d256a db: use invalidating iterators with low probability under invariants
bb9d6ab6 internal/invalidating: propagate lazy value fetching
9dbff72c internal/invalidating: trash trailers with 0xff
2689f0d2 internal/invalidating: move invalidating iter
96978427 db: add explicit levelIter pointer
20e07e1f db: avoid type conversion during iterator read sampling
```

Informs cockroachdb#110645.
Epic: none
Release note: none
@jbowens jbowens changed the title pebble: file does not exist error caused fatal error [TestDockerCLI_test_demo_multitenant] acceptance: TestDockerCLI_test_demo_multitenant shutdown synchronization Sep 15, 2023
craig bot pushed a commit that referenced this issue Sep 15, 2023
110651: kvnemesis: add support for shared lock r=miraradeva a=arulajmani

This patch teaches KVNemesis to generate batches that acquire shared locks, using Get, Scan, and ReverseScan requests.

Closes #100173
Release note: None

110696: release: do not bump version if already bumped r=celiala a=rail

Previously, if the version was bumped manually, the automation would try to bump, write the same content to the file and fail committing changes.

This PR skips version bump in case it was already bumped. Additionally, some of the commit messages were adjusted.

Fixes: RE-468
Release note: None

110723: go.mod: bump Pebble to 6d6570bf1e25 r=RaduBerinde a=jbowens

```
6d6570bf vfs: enforce mutual exclusion in MemFS.Lock
336c9979 metamorphic: vary sstable compression algorithm
c91e8796 db: double check file reference counts when loading file
22fbb69a scripts: generate code coverage with invariants build tag
529d256a db: use invalidating iterators with low probability under invariants
bb9d6ab6 internal/invalidating: propagate lazy value fetching
9dbff72c internal/invalidating: trash trailers with 0xff
2689f0d2 internal/invalidating: move invalidating iter
96978427 db: add explicit levelIter pointer
20e07e1f db: avoid type conversion during iterator read sampling
```

Informs #110645.
Epic: none
Release note: none

Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
@jbowens
Copy link
Collaborator

jbowens commented Sep 18, 2023

Closing as a duplicate of #110748

@jbowens jbowens closed this as not planned Won't fix, can't repro, duplicate, stale Sep 18, 2023
jbowens added a commit to jbowens/pebble that referenced this issue Sep 18, 2023
Previously MemFS's Lock method did not implement mutual exclusion under the
assumption that databases are given separate processes which do not share
memory. This commit adapts Lock to enforce mutual exclusion, preventing
accidentally using same FS concurrently from more than one Pebble instance.
It's believed that cockroachdb/cockroach#110645 is the result of Cockroach
acceptance tests opening a MemFS store twice, with one Pebble instance deleting
files from beneath another Pebble instance.
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). T-storage Storage Team
Projects
Archived in project
Development

No branches or pull requests

2 participants