-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: CheckSSTConflicts
computes incorrect stats for SST range keys
#94141
Comments
Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Previously, we were returning an obscure error around MVCCValueLenAndIsTombstone not being supported for interleaved intents if we encountered an intent in the engine. This change fixes that to return a more appropriate error. This change also fixes a case where the engine iterator would unexpectedly land on a key < the sst iterator, violating an invariant in this function. Fixes parts of cockroachdb#94141. Release note: None Epic: none
94395: storage: handle intents under ingested range key in CheckSSTConflicts r=erikgrinaker a=itsbilal Previously, we were returning an obscure error around MVCCValueLenAndIsTombstone not being supported for interleaved intents if we encountered an intent in the engine. This change fixes that to return a more appropriate error. This change also fixes a case where the engine iterator would unexpectedly land on a key < the sst iterator, violating an invariant in this function. Fixes parts of #94141. Release note: None Epic: None Co-authored-by: Bilal Akhtar <[email protected]>
Note that the first two parts of this issue are fixed by #94395. This issue is now solely about the stats divergence picked up by kvnemesis. |
Previously, we were returning an obscure error around MVCCValueLenAndIsTombstone not being supported for interleaved intents if we encountered an intent in the engine. This change fixes that to return a more appropriate error. This change also fixes a case where the engine iterator would unexpectedly land on a key < the sst iterator, violating an invariant in this function. Fixes parts of cockroachdb#94141. Release note: None Epic: none
Previously, CheckSSTConflicts would improperly call UnsafeKey on an exhausted sstIter. This could result in undefined, inconsistent behavior as UnsafeKey may point to arbitrary data once the iterator is exhausted. Informs cockroachdb#94141. Epic: None Release note: None
Previously, CheckSSTConflicts would improperly call UnsafeKey on an exhausted sstIter. This could result in undefined, inconsistent behavior as UnsafeKey may point to arbitrary data once the iterator is exhausted. Informs cockroachdb#94141. Epic: None Release note: None
96109: metrics: introduce CounterFloat64 and use for tenant RUs r=RaduBerinde,tbg a=dhartunian Previously, we were prevented from using float64s directly in metric counters since counters were limited to ints. This led to the use of Gauges in situations where Counters would be preferable since we didn't have code to help manage a monotonically increasing float64 value. This commit introduces some helpers around atomically adding float64s together and increasing one monotonically. Those primitives are composed into a `CounterFloat64` type further used to construct `AggCounterFloat64` which can be used in place of `AggCounter` to track per-tenant metrics. The two `GaugeFloat64` types used for tenant `totalRU` and `totalKVRU` metrics are replaced with the new `CounterFloat64` type to properly reflect the fact that these are monotonically increasing values. This helps Prometheus when scraping these metrics to correctly account for missing data if necessary. Resolves #68291 Epic: CRDB0-14536 Release note: None 96685: storage/pebbleiter: mangle unsafe buffers during positioning r=jbowens a=jbowens **storage/pebbleiter: mangle unsafe buffers between positioning methods** Randomly clobber the key and value buffers in order to ensure lifetimes are respected. This commit extends the `pebbleiter.assertionIter` type used in `crdb_test` builds that sits between pkg/storage and the pebble.Iterator type. It now will randomly zero the buffers used to return the previous iterator position's key and value, to ensure code isn't improperly relying on the stability of the these keys. **storage: fix UnsafeKey() usage of invalid iterator** Previously, CheckSSTConflicts would improperly call UnsafeKey on an exhausted sstIter. This could result in undefined, inconsistent behavior as UnsafeKey may point to arbitrary data once the iterator is exhausted. **kv/kvserver/spanset: fix read of invalid iterator's UnsafeKey** Previously, the spanset.MVCCIterator implementation would retrieve the contained iterator's UsafeKey() on an invalid iterator in Next, NextKey and Prev. This retrieval was harmless because it always checked the validity of the iterator before using it, but recent assertions in test builds prevent this usage. Epic: None Close #86657. Informs #94141. Release note: None 96832: ui: txn insights details fixes r=maryliag a=xinhaoz See individual commits. <img width="1115" alt="image" src="https://user-images.githubusercontent.com/20136951/217676969-94de02ba-1499-43c3-a542-ece3e5829a6c.png"> <img width="1911" alt="image" src="https://user-images.githubusercontent.com/20136951/217677020-c0a20035-f80b-472d-943d-a2bfd0c92b13.png"> 96893: roachtest: test with sqlalchemy 2.0 r=rafiss a=rafiss fixes #96880 backports for #96859 and #96856 Release note: None Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
CheckSSTConflicts
does not handle SST range keys correctlyCheckSSTConflicts
computes incorrect stats for SST range keys
98426: storage: Fix CheckSSTConflicts stats calculations r=erikgrinaker a=itsbilal Previously we were missing some cases of range key overlaps with other range keys, or were too eagerly stepping over engine keys that did not conflict with sstable keys, resulting in incorrect stats. This change adds more targeted test cases to TestEvalAddSSTable to test for those previously-unaccounted cases, and fixes them in CheckSSTConflicts. Informs #94140 and #98408. Fixes #94141. Epic: none Release note: None Co-authored-by: Bilal Akhtar <[email protected]>
98472: changefeedccl: enable TestChangefeedPropagatesTerminalError r=samiskin a=samiskin Fixes: #95057 The test TestChangefeedPropagatesTerminalError no longer seems to flake after running it under stress on a GCE worker for a while, so this change re-enables it. Release note: None 98475: kvnemesis: disable `AddSSTable` range keys r=erikgrinaker a=erikgrinaker These trigger MVCC stats bugs in `CheckSSTConflicts`. Touches #94141. Touches #98473. Touches #94876. Epic: none Release note: None Co-authored-by: Shiranka Miskin <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>
A few additional fixes around CheckSSTConflicts, stats calculations, and Next()ing logic, caught by kvnemesis. Hopefully the last of its kind. Fixes cockroachdb#94141. Fixes cockroachdb#98473. Informs cockroachdb#94876. Epic: none Release note: None
@erikgrinaker, @itsbilal — given we're playing a whack-a-mole with CheckSSTConflicts bugs here, how do you feel about the risk to 23.1? An unexpected MVCCStats divergence is not treated as a fatal error by the consistency checker, correct? Is there any reason to conservatively mark stats as estimate in error-prone conditions (eg, like the SST contains range keys). |
Correct, they'll simply get corrected by the consistency checker on the next run, with a log message saying as much. We have some heuristics here and there which do rely on MVCC stats, but I don't believe any of this is critical -- for example, the MVCC GC queue will check MVCC stats to determine whether ranges should be aggressively GCed (when a range is fully deleted by an MVCC range tombstone). cockroach/pkg/kv/kvserver/mvcc_gc_queue.go Lines 530 to 539 in 736a67e
I'm not particularly worried about an occasional MVCC stats error in the wild, and I believe 23.1 will only ingest SSTs with MVCC range tombstones during C2C replication where |
98368: multiregionccl,server: use cached sqlliveness.Reader, deflake ColdStartLatencyTest r=ajwerner a=ajwerner #### multitenantccl: deflake ColdStartLatencyTest This test was flakey due to the closed timestamp sometimes not leading far for global tables due to overload, and due to a cached liveness reader not being used in distsql. The former was fixed in previous commits. The latter is fixed here. Fixes: #96334 #### sql: use CachedReader for uses with sqlinstance and the sql builtins The CachedReader won't block, which in multi-region clusters is good. It will mean that in some cases, it'll state that a sessions is alive when it most certainly is not. Currently, nobody needs synchronous semantics. This is a major part of fixing the TestColdStartLatency as sometimes distsql planning would block. That's not acceptable -- the idea that query physical planning can need to wait for a cross-region RPC is unacceptable. #### sqlliveness: re-arrange APIs to clarify when the API was blocking By "default" the implementation of Reader was blocking and there was a method to get a handle to a non-blocking CachedReader(). This asymmetry does not aid understandability. The non-blocking reader came later, but it is generally the more desirable interface. Release note: None 98519: storage: More CheckSSTConflicts fixes r=erikgrinaker a=itsbilal A few additional fixes around CheckSSTConflicts, stats calculations, and Next()ing logic, caught by kvnemesis. Hopefully the last of its kind. Also re-enable kvnemesis testing for range keys in AddSSTable, reverting #98475. Fixes #94141. Fixes #98473. Informs #94876. Epic: none Release note: None 98567: backupccl: use correct version gate for restore checkpointing r=adityamaru a=msbutler PR #97862 introduced a subtle bug which allowed the new restore checkpointing policy to take effect before the 23_1 migrations occured. This patch ensures the new policy only takes effect after all migrations occur. Release note: None Epic: None Co-authored-by: ajwerner <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Michael Butler <[email protected]>
In #94140, I'm adding kvnemesis test coverage of
AddSSTable
ingestion of MVCC range keys. This found a number of problems:Conflicting intents are not handled correctly, errors with
checking for key collisions: not at MVCC value
. Fixed in storage: handle intents under ingested range key in CheckSSTConflicts #94395The SST and engine iterators get out of sync, errors with
checking for key collisions: expected engine iter to be ahead of sst iter
. Fixed in storage: handle intents under ingested range key in CheckSSTConflicts #94395Incorrect MVCC stats calculations. Not fixed yet
To reproduce, check out the branch in #94140 and run e.g.:
Jira issue: CRDB-22707
Epic CRDB-20465
The text was updated successfully, but these errors were encountered: