storage: unexpected GC queue activity immediately after DROP #20554

tbg · 2017-12-07T16:36:37Z

Experimentation notes. I'm running single-node release-1.1 with a tpch.lineitem
(SF 1) table restore. Without changing the TTL, I dropped this table last night.

The "live bytes" fell to ~zero within 30 minutes (i.e., it took 30 minutes for
all keys to be deleted, but not cleared yet) while on disk we're now using 1.7GB
instead of 1.3GB (makes sense since we wrote lots of MVCC tombstones).

What stuck out is that while this was going on, I saw lots of unexpected GC runs
that didn't get to delete data. I initially thought those must have been
triggered by the "intent age" (which spikes as the range deletion puts down many
many intents that are only cleaned up after commit; they're likely visible for
too long and get the replica queued). But what speaks against this theory is
that all night, GC was running in circles, apparently always triggered but never
successful at reducing the score. This strikes me as quite odd and needs more
investigation.

This morning, I changed the TTL to 100s and am seeing steady GC queue activity,
each run clearing out a whole range and making steady progress. Annoyingly, the
consistency checker is also running all the time, which can't help performance.
The GC queue took around 18 minutes to clean up ~1.3 on-disk-data worth of data,
which seems OK. After the run, the data directory stabilized at 200-300MB, which
after an offline-compaction drops to 8MB.

RocksDB seems to be running compactions, since the data directory (at the time
of writing) has dropped to 613MB and within a minute more to 419MB (with some
jittering). Logging output is quiet, memory usage is stable, though I'm sometimes
seeing 25 GC runs logged in the runtime stats which I think is higher than I am
used to seeing (the GC queue is not allocation efficient, so that makes some sense
to me).

Running the experiment again to look specifically into the first part.

tbg · 2017-12-07T16:58:14Z

Yeah, as suspected - we get a large score, but don't get to GC anything. That's unexpected as I thought this was a problem already addressed. Taking a look.


2017/12/07 11:54:33.525546 | 1.576784 | gc
-- | -- | --
11:54:33.525700 | .   154 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] processing replica with score queue=true with 284.59/fuzz(1.24)=229.84=valScaleScore(229.84)*deadFrac(1.00)+intentScore(0.00) likely last GC: 6.848607359s ago, 35 MiB non-live, curr. age 689 TiB*s, min exp. reduction: 686 TiB*s
11:54:33.525814 | .   114 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] iterating through range
11:54:35.101439 | 1.575551 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] processing abort cache
11:54:35.101484 | .    45 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] assembled GC keys, now proceeding to GC; stats so far {Now:1512665673.525563410,0 Policy:{TTLSeconds:90000} NumKeysAffected:0 IntentsConsidered:0 IntentTxns:0 TransactionSpanTotal:0 TransactionSpanGCAborted:0 TransactionSpanGCCommitted:0 TransactionSpanGCPending:0 TxnSpanGCThreshold:1512662073.525563410,0 AbortSpanTotal:0 AbortSpanConsidered:0 AbortSpanGCNum:0 PushTxn:0 ResolveTotal:0 ResolveSuccess:0 Threshold:1512575673.525563410,0}
11:54:35.101490 | .     6 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] sending batch 1 of 1
11:54:35.101496 | .     6 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] read-write path
11:54:35.101502 | .     5 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] command queue
11:54:35.101542 | .    41 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] applied timestamp cache
11:54:35.101569 | .    26 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] done with GC evaluation for 0 keys at 0.00 keys/sec. Deleted 0 versions
11:54:35.101621 | .    53 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] evaluated request
11:54:35.101631 | .    10 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] acquired {raft,replica}mu
11:54:35.102016 | .   385 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] applying command
11:54:35.102292 | .   276 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] pushing up to 0 transactions (concurrency 25)
11:54:35.102295 | .     3 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] resolving 0 intents
11:54:35.102314 | .    19 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] MVCC stats after GC: {ContainsEstimates:false LastUpdateNanos:1512665522449220018 IntentAge:0 GCBytesAge:757067559959454 LiveBytes:0 LiveCount:0 KeyBytes:9386345 KeyCount:253685 ValBytes:27212282 ValCount:507370 IntentBytes:0 IntentCount:0 SysBytes:545 SysCount:9}
11:54:35.102398 | .    84 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] GC score after GC: queue=true with 284.59/fuzz(1.24)=229.84=valScaleScore(229.84)*deadFrac(1.00)+intentScore(0.00) likely last GC: 1.576751492s ago, 35 MiB non-live, curr. age 689 TiB*s, min exp. reduction: 686 TiB*s

tbg · 2017-12-07T18:37:50Z

Yikes, so here's a problem that I just verified in a unit test. Assume that at timestamp 0, you write a key-value pair that is 1MB in total. Then, at timestamp 1000E10 (1000 seconds later) you delete it (i..e write a tombstone at 1000E10). What should be the GCBytesAge at 1000E10?

The answer is zero, as the data has been deleted for 0 seconds. But, problematically, we actually compute a GCBytesAge of 1MB*1000E10.

This explains the behavior I'm seeing: the restore I'm running has all of its data written sometime in February this year. As it gets deleted (MVCC tombstoned), it should start out with a GCBytesAge of zero and grow by 1MB every second. But it starts out with 1000E10*1MB instead. As a result, the GC queue immediately jumps into action, thinking it'll get to delete stuff. But it won't for another TTLSeconds.

tbg · 2017-12-07T18:39:01Z

The stats are fixable, but it's not clear how to best roll out that fix (we'd need to force full stats recomputations on everything). I'll whip up a fix and then we can worry on how it would actually be retrofitted to existing data.

tbg · 2017-12-07T22:31:34Z

I spent the last couple of hours trying to fix the stats computation, and while I think I got it mostly right, there's an unfortunate complication regarding the two extra implementations of stats keeping in (Iterator).ComputeStats. These try to be smart by making the computations more straightforward for the case of a forward iteration through all kv pairs & versions, but the smartness doesn't quite work for the fixed computation.

To illustrate this, consider the case in which one

writes a value at t=0s
deletes it at t=1s
at t=2s calls iter.ComputeStats.

In the MVCC layer, the stats would compute as follows:

after first write: live bytes = (say) 150kb, key bytes = (say) 125, val bytes = 25
after delete: live bytes = 0, key bytes = 125+10, val bytes=25 (the 10 are the tombstone key, for which we only count the timestamp suffix)
aging that stat to t=2s gives us 1s*(key+val-live) = 1s*160 = 150

The iterator will instead visit the keys in reverse order, seeing first the tombstone (total size 10) and then the value (total size 150). But it has to augment GCBytesAge when it sees the tombstone, and it hasn't seen the value yet (which carries the bulk of the weight). Generalizing this to the case in which there are long chains of deletions and recreations, it becomes obvious that the iteration needs to be in reverse order for the computation to be possible (at which point you can build up an MVCCStats organically).

The property that was used to make it work with forward iteration was that the old computation needed only the local key to determine its ultimate impact on GCBytesAge. But that was incorrect.

For extra fun, a copy of that code exists in C++ as well.

tbg · 2017-12-07T23:46:05Z

I've made some progress rewriting the Go iterator version but it'll take a few more days hunkering down and adding commentary. There's also a some complexity that leaks into ccl code because that previously passed a SimpleIterator (which can't iterate backwards) but that interface is now too narrow.

petermattis · 2017-12-08T15:04:19Z

@tschottdorf In your rewrite, you need to call Iterator.Prev()? That's problematic, not only because SimpleIterator can't iterate backwards, but because backwards iteration is slower and we want/need ComputeStats to be fast. If I understanding you correctly, I think it would be better to maintain stats about the sizes of versions and tombstones in a single forward pass (I think this could be stored in a slice). But since I haven't seen the code, perhaps I'm misunderstanding something significant. Also, I'm not understanding how we'll incrementally keep MVCCStats updated correctly.

tbg · 2017-12-08T20:09:49Z

@petermattis I'm aware of these limitations and am equally concerned about them. Let's chat next week. I'm not thrilled about a slice as we couldn't bound its size. Have to think some more about this.

tbg · 2017-12-09T01:44:09Z

I forgot to write this earlier: I think we can still do the forward iteration without the need for a slice. This is hopefully less bad than I think (don't mark my words just yet, though).

tbg · 2017-12-12T20:53:19Z

The good news is that I managed to convince myself that fixing ComputeStatsGo is actually rather straightforward while keeping the current code mostly intact. All that's needed is to track the wall time of nearest deletion at a higher timestamp and to use that to update GCBytesAge (as opposed to using the own timestamp) with exception for deletion tombstones itself, which continue to use their own timestamp).

Instead, a snag I've hit is with the incremental updates of stats during writes. To see what the problem is, think about aborting an intent in today's code. The code that does this naively sees only the intent's meta record, but that is not enough to reconstruct the previous stats. This is easy to see if the intent is a deletion: there is no information about the previous value in the meta (it's just a tombstone), yet the live bytes already reflect the previous value as having been deleted (NB: if we changed this, committing would be harder instead).

Once I make the change that ties the GCBytesAge to the intent's timestamp (and not the shadowed/deleted version's), I have a similar problem: when an intent gets pushed (or commits at a higher timestamp), I have to read the previous version to collect its stats.

On top of fixing this, we need to also discuss whether or how to fix the incorrect stats out there. My hope is that we can make do with a workaround in the GC queue: if the GC cycle finishes without having gotten to delete anything, (in-memory) mark the replica as "non-gc'able" for the next TTLSeconds. After that time, the deletion will actually succeed. This will never really restore the correct stats, though, since we'll subtract the new computation from GCBytesAge whereas it was computed with the old one, so a base amount of overcounting remains (basically forever). Thanks to stats commutativity, we could recompute from a snapshot and send a delta. However, this needs its own queue, etc, so it's more of a long shot. I guess I'll whip up the fix first.

tbg · 2017-12-12T21:56:40Z

Thanks to stats commutativity, we could recompute from a snapshot and send a delta.

Actually it seems rather reasonable that the GC queue would compute the associated stats mismatch while it's going through all the keys and versions anyway, and then tack on a diff to apply. We can compute this efficiently by refactoring ComputeStatsGo to expose an interface that allows feeding in key/value pairs from an iterator in forward order (as opposed to iterating directly) so that we can simply "show" all the kv pairs to it.

To address cockroachdb#20554, we'll need to use this in more places. Release note: None

Found while investigating cockroachdb#20554 (though this does not at all fix the issue discovered there; that will be a separate PR). We were incorrectly updating GCBytesAge when moving an intent. This change raises the question of what to do about the divergent stats that exist in real-world clusters. As part of addressing cockroachdb#20554, we'll need a mechanism to correct the stats anyway, and so I will defer its introduction. You'll want to view this diff with `?w=1` (insensitive to whitespace changes). Release note: None.

To address cockroachdb#20554, we'll need to use this in more places. Release note: None

Found while investigating cockroachdb#20554 (though this does not at all fix the issue discovered there; that will be a separate PR). We were incorrectly updating GCBytesAge when moving an intent. This change raises the question of what to do about the divergent stats that exist in real-world clusters. As part of addressing cockroachdb#20554, we'll need a mechanism to correct the stats anyway, and so I will defer its introduction. You'll want to view this diff with `?w=1` (insensitive to whitespace changes). Release note: None.

To address cockroachdb#20554, we'll need to use this in more places. Release note: None

Found while investigating cockroachdb#20554 (though this does not at all fix the issue discovered there; that will be a separate PR). We were incorrectly updating GCBytesAge when moving an intent. The old code was pretty broken (and, with hindsight, it is still after this commit, as is exposed by the various child commits). Its main problem was that it failed to account for the `GCBytesAge` difference that would result from moving the intent due to the incorrect assumption that the size of the intent would be the same. The code was also somewhat intransparent and an attempt has been made to improve its legibility. This change raises the question of what to do about the divergent stats that exist in real-world clusters. As part of addressing cockroachdb#20554, we'll need a mechanism to correct the stats anyway, and so I will defer its introduction. You'll want to view this diff with `?w=1` (insensitive to whitespace changes). Release note: None.

RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration. Once activated, only the new tombstone is written, but both tombstones are checked. Additionally, as the node restarts, all legacy tombstones are replaced by correct non-legacy ones. - when applying a snapshot, forcibly delete any legacy tombstone contained within (even before the cluster version is bumped). This prevents new legacy tombstones to trickle on from other nodes. `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.1, as at that point all peers have booted with a version that runs the migration. Thus, post v2.1, the replica consistency checker can stop skipping the legacy tombstone key. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070

A number of bugs in our MVCCStats logic has been fixed recently (see for example \cockroachdb#20996) and others are still present (cockroachdb#20554). For both, and also potentially for future bugs or deliberate adjustments of the computations, we need a mechanism to recompute the stats in order to purge incorrect numbers from the system over time. Such a mechanism is introduced here. It consists of two main components: - A new RPC `AdjustStats`, which applies to a single Range and computes the difference between the persisted stats and the recomputed ones; it can "inject" a suitable delta into the stats (thus fixing the divergence) or do a "dry run". - A trigger in the consistency checker that runs on the coordinating node (the lease holder). The consistency checker already recomputes the stats, and it can compare them against the persisted stats and judge whether there is a divergence. If there is one, naively one may hope to just insert the newly computed diff into the range, but this is not ideal due to concerns about double application and racing consistency checks. Instead, use `AdjustStats` (which, of course, was introduced for this purpose) which strikes a balance between efficiency and correctness. Updates cockroachdb#20554. Release note (general change): added a mechanism to recompute range stats on demand.

The semantics for computing GCBytesAge were incorrect and are fixed in this commit. Prior to this commit, a non-live write would accrue GCBytesAge from its own timestamp on. That is, if you wrote two versions of a key at 1s and 2s, then when the older version is replaced (at 2s) it would start out with one second of age (from 1s to 2s). However, the key *really* became non-live at 2s, and should have had an age of zero. By extension, imagine a table with lots of writes all dating back to early 2017, and assume that today (early 2018) all these writes are deleted (i.e. a tombstone placed on top of them). Prior to this commit, each key would immediately get assigned an age of `(early 2018) - early 2017)`, i.e. a very large number. Yet, the GC queue could only garbage collect them after (early 2018) + TTL`, so by default 25 hours after the deletion. We use GCBytesAge to trigger the GC queue, so that would cause the GC queue to run without ever getting to remove anything, for the TTL. This was a big problem bound to be noticed by users. This commit changes the semantics to what the GCQueue (and the layman) expects: 1. when a version is shadowed, it becomes non-live at that point and also starts accruing GCBytesAge from that point on. 2. deletion tombstones are an exception: They accrue age from their own timestamp on. This makes sense because a tombstone can be deleted whenever it's older than the TTL (as opposed to a value, which can only be deleted once it's been *shadowed* for longer than the TTL). This work started out by updating `ComputeStatsGo` to have the desired semantics, fixing up existing tests, and then stress testing `TestMVCCStatsRandomized` with short history lengths to discover failure modes which were then transcribed into small unit tests. When no more such failures were discoverable, the resulting logic in the various incremental MVCCStats update helpers was simplified and documented, and `ComputeStats` updated accordingly. In turn, `TestMVCCStatsBasic` was removed: it was notoriously hard to read and maintain, and does not add any coverage at this point. The recomputation of the stats in existing clusters is addressed in cockroachdb#21345. Fixes cockroachdb#20554.

A number of bugs in our MVCCStats logic has been fixed recently (see for example \cockroachdb#20996) and others are still present (cockroachdb#20554). For both, and also potentially for future bugs or deliberate adjustments of the computations, we need a mechanism to recompute the stats in order to purge incorrect numbers from the system over time. Such a mechanism is introduced here. It consists of two main components: - A new RPC `RecomputeStats`, which applies to a single Range and computes the difference between the persisted stats and the recomputed ones; it can "inject" a suitable delta into the stats (thus fixing the divergence) or do a "dry run". - A trigger in the consistency checker that runs on the coordinating node (the lease holder). The consistency checker already recomputes the stats, and it can compare them against the persisted stats and judge whether there is a divergence. If there is one, naively one may hope to just insert the newly computed diff into the range, but this is not ideal due to concerns about double application and racing consistency checks. Instead, use `RecomputeStats` (which, of course, was introduced for this purpose) which strikes a balance between efficiency and correctness. Updates cockroachdb#20554. Release note (general change): added a mechanism to recompute range stats on demand.

The semantics for computing GCBytesAge were incorrect and are fixed in this commit. Prior to this commit, a non-live write would accrue GCBytesAge from its own timestamp on. That is, if you wrote two versions of a key at 1s and 2s, then when the older version is replaced (at 2s) it would start out with one second of age (from 1s to 2s). However, the key *really* became non-live at 2s, and should have had an age of zero. By extension, imagine a table with lots of writes all dating back to early 2017, and assume that today (early 2018) all these writes are deleted (i.e. a tombstone placed on top of them). Prior to this commit, each key would immediately get assigned an age of `(early 2018) - early 2017)`, i.e. a very large number. Yet, the GC queue could only garbage collect them after (early 2018) + TTL`, so by default 25 hours after the deletion. We use GCBytesAge to trigger the GC queue, so that would cause the GC queue to run without ever getting to remove anything, for the TTL. This was a big problem bound to be noticed by users. This commit changes the semantics to what the GCQueue (and the layman) expects: 1. when a version is shadowed, it becomes non-live at that point and also starts accruing GCBytesAge from that point on. 2. deletion tombstones are an exception: They accrue age from their own timestamp on. This makes sense because a tombstone can be deleted whenever it's older than the TTL (as opposed to a value, which can only be deleted once it's been *shadowed* for longer than the TTL). This work started out by updating `ComputeStatsGo` to have the desired semantics, fixing up existing tests, and then stress testing `TestMVCCStatsRandomized` with short history lengths to discover failure modes which were then transcribed into small unit tests. When no more such failures were discoverable, the resulting logic in the various incremental MVCCStats update helpers was simplified and documented, and `ComputeStats` updated accordingly. In turn, `TestMVCCStatsBasic` was removed: it was notoriously hard to read and maintain, and does not add any coverage at this point. The recomputation of the stats in existing clusters is addressed in cockroachdb#21345. Fixes cockroachdb#20554. Release note (bug fix): fix a problem that could cause spurious GC activity, in particular after dropping a table.

A number of bugs in our MVCCStats logic has been fixed recently (see for example \cockroachdb#20996) and others are still present (cockroachdb#20554). For both, and also potentially for future bugs or deliberate adjustments of the computations, we need a mechanism to recompute the stats in order to purge incorrect numbers from the system over time. Such a mechanism is introduced here. It consists of two main components: - A new RPC `RecomputeStats`, which applies to a single Range and computes the difference between the persisted stats and the recomputed ones; it can "inject" a suitable delta into the stats (thus fixing the divergence) or do a "dry run". - A trigger in the consistency checker that runs on the coordinating node (the lease holder). The consistency checker already recomputes the stats, and it can compare them against the persisted stats and judge whether there is a divergence. If there is one, naively one may hope to just insert the newly computed diff into the range, but this is not ideal due to concerns about double application and racing consistency checks. Instead, use `RecomputeStats` (which, of course, was introduced for this purpose) which strikes a balance between efficiency and correctness. Updates cockroachdb#20554. Release note (general change): added a mechanism to recompute range stats on demand.

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 7, 2017

tbg self-assigned this Dec 7, 2017

tbg added the high priority label Dec 7, 2017

tbg mentioned this issue Dec 9, 2017

storage: GC queue busy loop after canceled import #20159

Closed

tbg mentioned this issue Dec 21, 2017

engine: fix divergent MVCCStats updates #20996

Merged

tbg added a commit to tbg/cockroach that referenced this issue Dec 21, 2017

engine: factor out unsafePeekNextVersion

b1b7601

To address cockroachdb#20554, we'll need to use this in more places. Release note: None

tbg added a commit to tbg/cockroach that referenced this issue Dec 24, 2017

engine: factor out unsafePeekNextVersion

73a8a5e

To address cockroachdb#20554, we'll need to use this in more places. Release note: None

tbg added a commit to tbg/cockroach that referenced this issue Dec 24, 2017

engine: factor out unsafeNextVersion

68b8e5d

To address cockroachdb#20554, we'll need to use this in more places. Release note: None

tbg mentioned this issue Dec 24, 2017

engine: factor out unsafeNextVersion #21036

Merged

tbg mentioned this issue Jan 9, 2018

storage: recompute and adjust stats via consistency checker #21345

Merged

tbg mentioned this issue Jan 11, 2018

cmd/workload: add fixtures subcommand #21373

Merged

tbg mentioned this issue Jan 11, 2018

storage: fix (the semantics of) MVCCStats.GCBytesAge #21407

Merged

tbg closed this as completed in #21407 Jan 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: unexpected GC queue activity immediately after DROP #20554

storage: unexpected GC queue activity immediately after DROP #20554

tbg commented Dec 7, 2017 •

edited by nvanbenschoten

Loading

tbg commented Dec 7, 2017

tbg commented Dec 7, 2017

tbg commented Dec 7, 2017

tbg commented Dec 7, 2017

tbg commented Dec 7, 2017

petermattis commented Dec 8, 2017

tbg commented Dec 8, 2017

tbg commented Dec 9, 2017

tbg commented Dec 12, 2017

tbg commented Dec 12, 2017

storage: unexpected GC queue activity immediately after DROP #20554

storage: unexpected GC queue activity immediately after DROP #20554

Comments

tbg commented Dec 7, 2017 • edited by nvanbenschoten Loading

tbg commented Dec 7, 2017

tbg commented Dec 7, 2017

tbg commented Dec 7, 2017

tbg commented Dec 7, 2017

tbg commented Dec 7, 2017

petermattis commented Dec 8, 2017

tbg commented Dec 8, 2017

tbg commented Dec 9, 2017

tbg commented Dec 12, 2017

tbg commented Dec 12, 2017

tbg commented Dec 7, 2017 •

edited by nvanbenschoten

Loading