Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: unexpected GC queue activity immediately after DROP #20554

Closed
tbg opened this issue Dec 7, 2017 · 10 comments · Fixed by #21407
Closed

storage: unexpected GC queue activity immediately after DROP #20554

tbg opened this issue Dec 7, 2017 · 10 comments · Fixed by #21407
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Dec 7, 2017

Experimentation notes. I'm running single-node release-1.1 with a tpch.lineitem
(SF 1) table restore. Without changing the TTL, I dropped this table last night.

The "live bytes" fell to ~zero within 30 minutes (i.e., it took 30 minutes for
all keys to be deleted, but not cleared yet) while on disk we're now using 1.7GB
instead of 1.3GB (makes sense since we wrote lots of MVCC tombstones).

What stuck out is that while this was going on, I saw lots of unexpected GC runs
that didn't get to delete data. I initially thought those must have been
triggered by the "intent age" (which spikes as the range deletion puts down many
many intents that are only cleaned up after commit; they're likely visible for
too long and get the replica queued). But what speaks against this theory is
that all night, GC was running in circles, apparently always triggered but never
successful at reducing the score. This strikes me as quite odd and needs more
investigation.

This morning, I changed the TTL to 100s and am seeing steady GC queue activity,
each run clearing out a whole range and making steady progress. Annoyingly, the
consistency checker is also running all the time, which can't help performance.
The GC queue took around 18 minutes to clean up ~1.3 on-disk-data worth of data,
which seems OK. After the run, the data directory stabilized at 200-300MB, which
after an offline-compaction drops to 8MB.

RocksDB seems to be running compactions, since the data directory (at the time
of writing) has dropped to 613MB and within a minute more to 419MB (with some
jittering). Logging output is quiet, memory usage is stable, though I'm sometimes
seeing 25 GC runs logged in the runtime stats which I think is higher than I am
used to seeing (the GC queue is not allocation efficient, so that makes some sense
to me).

Running the experiment again to look specifically into the first part.

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 7, 2017
@tbg tbg self-assigned this Dec 7, 2017
@tbg
Copy link
Member Author

tbg commented Dec 7, 2017

Yeah, as suspected - we get a large score, but don't get to GC anything. That's unexpected as I thought this was a problem already addressed. Taking a look.


2017/12/07 11:54:33.525546 | 1.576784 | gc
-- | -- | --
11:54:33.525700 | .   154 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] processing replica with score queue=true with 284.59/fuzz(1.24)=229.84=valScaleScore(229.84)*deadFrac(1.00)+intentScore(0.00) likely last GC: 6.848607359s ago, 35 MiB non-live, curr. age 689 TiB*s, min exp. reduction: 686 TiB*s
11:54:33.525814 | .   114 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] iterating through range
11:54:35.101439 | 1.575551 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] processing abort cache
11:54:35.101484 | .    45 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] assembled GC keys, now proceeding to GC; stats so far {Now:1512665673.525563410,0 Policy:{TTLSeconds:90000} NumKeysAffected:0 IntentsConsidered:0 IntentTxns:0 TransactionSpanTotal:0 TransactionSpanGCAborted:0 TransactionSpanGCCommitted:0 TransactionSpanGCPending:0 TxnSpanGCThreshold:1512662073.525563410,0 AbortSpanTotal:0 AbortSpanConsidered:0 AbortSpanGCNum:0 PushTxn:0 ResolveTotal:0 ResolveSuccess:0 Threshold:1512575673.525563410,0}
11:54:35.101490 | .     6 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] sending batch 1 of 1
11:54:35.101496 | .     6 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] read-write path
11:54:35.101502 | .     5 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] command queue
11:54:35.101542 | .    41 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] applied timestamp cache
11:54:35.101569 | .    26 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] done with GC evaluation for 0 keys at 0.00 keys/sec. Deleted 0 versions
11:54:35.101621 | .    53 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] evaluated request
11:54:35.101631 | .    10 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] acquired {raft,replica}mu
11:54:35.102016 | .   385 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] applying command
11:54:35.102292 | .   276 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] pushing up to 0 transactions (concurrency 25)
11:54:35.102295 | .     3 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] resolving 0 intents
11:54:35.102314 | .    19 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] MVCC stats after GC: {ContainsEstimates:false LastUpdateNanos:1512665522449220018 IntentAge:0 GCBytesAge:757067559959454 LiveBytes:0 LiveCount:0 KeyBytes:9386345 KeyCount:253685 ValBytes:27212282 ValCount:507370 IntentBytes:0 IntentCount:0 SysBytes:545 SysCount:9}
11:54:35.102398 | .    84 | ... [gc,n1,s1,r34/1:/Table/51/1/22142{3716…-5804…}] GC score after GC: queue=true with 284.59/fuzz(1.24)=229.84=valScaleScore(229.84)*deadFrac(1.00)+intentScore(0.00) likely last GC: 1.576751492s ago, 35 MiB non-live, curr. age 689 TiB*s, min exp. reduction: 686 TiB*s


@tbg
Copy link
Member Author

tbg commented Dec 7, 2017

Yikes, so here's a problem that I just verified in a unit test. Assume that at timestamp 0, you write a key-value pair that is 1MB in total. Then, at timestamp 1000E10 (1000 seconds later) you delete it (i..e write a tombstone at 1000E10). What should be the GCBytesAge at 1000E10?

The answer is zero, as the data has been deleted for 0 seconds. But, problematically, we actually compute a GCBytesAge of 1MB*1000E10.

This explains the behavior I'm seeing: the restore I'm running has all of its data written sometime in February this year. As it gets deleted (MVCC tombstoned), it should start out with a GCBytesAge of zero and grow by 1MB every second. But it starts out with 1000E10*1MB instead. As a result, the GC queue immediately jumps into action, thinking it'll get to delete stuff. But it won't for another TTLSeconds.

@tbg
Copy link
Member Author

tbg commented Dec 7, 2017

The stats are fixable, but it's not clear how to best roll out that fix (we'd need to force full stats recomputations on everything). I'll whip up a fix and then we can worry on how it would actually be retrofitted to existing data.

@tbg
Copy link
Member Author

tbg commented Dec 7, 2017

I spent the last couple of hours trying to fix the stats computation, and while I think I got it mostly right, there's an unfortunate complication regarding the two extra implementations of stats keeping in (Iterator).ComputeStats. These try to be smart by making the computations more straightforward for the case of a forward iteration through all kv pairs & versions, but the smartness doesn't quite work for the fixed computation.

To illustrate this, consider the case in which one

  • writes a value at t=0s
  • deletes it at t=1s
  • at t=2s calls iter.ComputeStats.

In the MVCC layer, the stats would compute as follows:

  • after first write: live bytes = (say) 150kb, key bytes = (say) 125, val bytes = 25
  • after delete: live bytes = 0, key bytes = 125+10, val bytes=25 (the 10 are the tombstone key, for which we only count the timestamp suffix)
  • aging that stat to t=2s gives us 1s*(key+val-live) = 1s*160 = 150

The iterator will instead visit the keys in reverse order, seeing first the tombstone (total size 10) and then the value (total size 150). But it has to augment GCBytesAge when it sees the tombstone, and it hasn't seen the value yet (which carries the bulk of the weight). Generalizing this to the case in which there are long chains of deletions and recreations, it becomes obvious that the iteration needs to be in reverse order for the computation to be possible (at which point you can build up an MVCCStats organically).

The property that was used to make it work with forward iteration was that the old computation needed only the local key to determine its ultimate impact on GCBytesAge. But that was incorrect.

For extra fun, a copy of that code exists in C++ as well.

@tbg
Copy link
Member Author

tbg commented Dec 7, 2017

I've made some progress rewriting the Go iterator version but it'll take a few more days hunkering down and adding commentary. There's also a some complexity that leaks into ccl code because that previously passed a SimpleIterator (which can't iterate backwards) but that interface is now too narrow.

@petermattis
Copy link
Collaborator

@tschottdorf In your rewrite, you need to call Iterator.Prev()? That's problematic, not only because SimpleIterator can't iterate backwards, but because backwards iteration is slower and we want/need ComputeStats to be fast. If I understanding you correctly, I think it would be better to maintain stats about the sizes of versions and tombstones in a single forward pass (I think this could be stored in a slice). But since I haven't seen the code, perhaps I'm misunderstanding something significant. Also, I'm not understanding how we'll incrementally keep MVCCStats updated correctly.

@tbg
Copy link
Member Author

tbg commented Dec 8, 2017

@petermattis I'm aware of these limitations and am equally concerned about them. Let's chat next week. I'm not thrilled about a slice as we couldn't bound its size. Have to think some more about this.

@tbg
Copy link
Member Author

tbg commented Dec 9, 2017

I forgot to write this earlier: I think we can still do the forward iteration without the need for a slice. This is hopefully less bad than I think (don't mark my words just yet, though).

@tbg
Copy link
Member Author

tbg commented Dec 12, 2017

The good news is that I managed to convince myself that fixing ComputeStatsGo is actually rather straightforward while keeping the current code mostly intact. All that's needed is to track the wall time of nearest deletion at a higher timestamp and to use that to update GCBytesAge (as opposed to using the own timestamp) with exception for deletion tombstones itself, which continue to use their own timestamp).

Instead, a snag I've hit is with the incremental updates of stats during writes. To see what the problem is, think about aborting an intent in today's code. The code that does this naively sees only the intent's meta record, but that is not enough to reconstruct the previous stats. This is easy to see if the intent is a deletion: there is no information about the previous value in the meta (it's just a tombstone), yet the live bytes already reflect the previous value as having been deleted (NB: if we changed this, committing would be harder instead).

Once I make the change that ties the GCBytesAge to the intent's timestamp (and not the shadowed/deleted version's), I have a similar problem: when an intent gets pushed (or commits at a higher timestamp), I have to read the previous version to collect its stats.

On top of fixing this, we need to also discuss whether or how to fix the incorrect stats out there. My hope is that we can make do with a workaround in the GC queue: if the GC cycle finishes without having gotten to delete anything, (in-memory) mark the replica as "non-gc'able" for the next TTLSeconds. After that time, the deletion will actually succeed. This will never really restore the correct stats, though, since we'll subtract the new computation from GCBytesAge whereas it was computed with the old one, so a base amount of overcounting remains (basically forever). Thanks to stats commutativity, we could recompute from a snapshot and send a delta. However, this needs its own queue, etc, so it's more of a long shot. I guess I'll whip up the fix first.

@tbg
Copy link
Member Author

tbg commented Dec 12, 2017

Thanks to stats commutativity, we could recompute from a snapshot and send a delta.

Actually it seems rather reasonable that the GC queue would compute the associated stats mismatch while it's going through all the keys and versions anyway, and then tack on a diff to apply. We can compute this efficiently by refactoring ComputeStatsGo to expose an interface that allows feeding in key/value pairs from an iterator in forward order (as opposed to iterating directly) so that we can simply "show" all the kv pairs to it.

tbg added a commit to tbg/cockroach that referenced this issue Dec 21, 2017
To address cockroachdb#20554, we'll need to use this in more places.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 21, 2017
Found while investigating cockroachdb#20554 (though this does not at all fix the
issue discovered there; that will be a separate PR).

We were incorrectly updating GCBytesAge when moving an intent.

This change raises the question of what to do about the divergent stats
that exist in real-world clusters. As part of addressing cockroachdb#20554, we'll
need a mechanism to correct the stats anyway, and so I will defer its
introduction.

You'll want to view this diff with `?w=1` (insensitive to whitespace changes).

Release note: None.
tbg added a commit to tbg/cockroach that referenced this issue Dec 24, 2017
To address cockroachdb#20554, we'll need to use this in more places.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 24, 2017
Found while investigating cockroachdb#20554 (though this does not at all fix the
issue discovered there; that will be a separate PR).

We were incorrectly updating GCBytesAge when moving an intent.

This change raises the question of what to do about the divergent stats
that exist in real-world clusters. As part of addressing cockroachdb#20554, we'll
need a mechanism to correct the stats anyway, and so I will defer its
introduction.

You'll want to view this diff with `?w=1` (insensitive to whitespace changes).

Release note: None.
tbg added a commit to tbg/cockroach that referenced this issue Dec 24, 2017
To address cockroachdb#20554, we'll need to use this in more places.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 24, 2017
Found while investigating cockroachdb#20554 (though this does not at all fix the
issue discovered there; that will be a separate PR).

We were incorrectly updating GCBytesAge when moving an intent. The old code was
pretty broken (and, with hindsight, it is still after this commit, as is exposed
by the various child commits). Its main problem was that it failed to account
for the `GCBytesAge` difference that would result from moving the intent due
to the incorrect assumption that the size of the intent would be the same. The
code was also somewhat intransparent and an attempt has been made to improve
its legibility.

This change raises the question of what to do about the divergent stats that
exist in real-world clusters. As part of addressing cockroachdb#20554, we'll need a
mechanism to correct the stats anyway, and so I will defer its introduction.

You'll want to view this diff with `?w=1` (insensitive to whitespace changes).

Release note: None.
tbg added a commit to tbg/cockroach that referenced this issue Dec 26, 2017
Found while investigating cockroachdb#20554 (though this does not at all fix the
issue discovered there; that will be a separate PR).

We were incorrectly updating GCBytesAge when moving an intent. The old code was
pretty broken (and, with hindsight, it is still after this commit, as is exposed
by the various child commits). Its main problem was that it failed to account
for the `GCBytesAge` difference that would result from moving the intent due
to the incorrect assumption that the size of the intent would be the same. The
code was also somewhat intransparent and an attempt has been made to improve
its legibility.

This change raises the question of what to do about the divergent stats that
exist in real-world clusters. As part of addressing cockroachdb#20554, we'll need a
mechanism to correct the stats anyway, and so I will defer its introduction.

You'll want to view this diff with `?w=1` (insensitive to whitespace changes).

Release note: None.
tbg added a commit to tbg/cockroach that referenced this issue Jan 8, 2018
RaftTombstoneKey was accidentally made a replicated key when it was first
introduced, a problem we first realized existed when it was [included in
snapshots].

At the time, we included workarounds to skip this key in various places
(snapshot application, consistency checker) but of course we have failed to
insert further hacks of the same kind elsewhere since (the one that prompting
this PR being the stats recomputation on splits, which I'm looking into as part
of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem)

It feels sloppy that we didn't follow through back then, but luckily the damage
appears to be limited; it is likely that the replicated existence of this key
results in MVCCStats SysBytes inconsistencies, but as it happens, these stats
are [already] [very] [inconsistent].

This commit does a few things:

- renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey`
- introduces a (correctly unreplicated) `RaftTombstoneKey`
- introduces a migration. Once activated, only the new tombstone is written,
  but both tombstones are checked. Additionally, as the node restarts, all
  legacy tombstones are replaced by correct non-legacy ones.
- when applying a snapshot, forcibly delete any legacy tombstone contained
  within (even before the cluster version is bumped). This prevents new
  legacy tombstones to trickle on from other nodes.

`RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries
post v2.1, as at that point all peers have booted with a version that runs the
migration.

Thus, post v2.1, the replica consistency checker can stop skipping the legacy
tombstone key.

Fixes cockroachdb#12154.

Release note: None

[included in snapshots]: cockroachdb#12131
[already]: cockroachdb#20554
[very]: cockroachdb#20996
[inconsistent]: cockroachdb#21070
tbg added a commit to tbg/cockroach that referenced this issue Jan 9, 2018
RaftTombstoneKey was accidentally made a replicated key when it was first
introduced, a problem we first realized existed when it was [included in
snapshots].

At the time, we included workarounds to skip this key in various places
(snapshot application, consistency checker) but of course we have failed to
insert further hacks of the same kind elsewhere since (the one that prompting
this PR being the stats recomputation on splits, which I'm looking into as part
of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem)

It feels sloppy that we didn't follow through back then, but luckily the damage
appears to be limited; it is likely that the replicated existence of this key
results in MVCCStats SysBytes inconsistencies, but as it happens, these stats
are [already] [very] [inconsistent].

This commit does a few things:

- renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey`
- introduces a (correctly unreplicated) `RaftTombstoneKey`
- introduces a migration. Once activated, only the new tombstone is written,
  but both tombstones are checked. Additionally, as the node restarts, all
  legacy tombstones are replaced by correct non-legacy ones.
- when applying a snapshot, forcibly delete any legacy tombstone contained
  within (even before the cluster version is bumped). This prevents new
  legacy tombstones to trickle on from other nodes.

`RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries
post v2.1, as at that point all peers have booted with a version that runs the
migration.

Thus, post v2.1, the replica consistency checker can stop skipping the legacy
tombstone key.

Fixes cockroachdb#12154.

Release note: None

[included in snapshots]: cockroachdb#12131
[already]: cockroachdb#20554
[very]: cockroachdb#20996
[inconsistent]: cockroachdb#21070
tbg added a commit to tbg/cockroach that referenced this issue Jan 9, 2018
A number of bugs in our MVCCStats logic has been fixed recently (see for example
\cockroachdb#20996) and others are still present (cockroachdb#20554). For both, and also potentially for
future bugs or deliberate adjustments of the computations, we need a mechanism
to recompute the stats in order to purge incorrect numbers from the system over
time. Such a mechanism is introduced here.

It consists of two main components:

- A new RPC `AdjustStats`, which applies to a single Range and computes the difference
  between the persisted stats and the recomputed ones; it can "inject" a suitable delta
  into the stats (thus fixing the divergence) or do a "dry run".
- A trigger in the consistency checker that runs on the coordinating node (the lease
  holder). The consistency checker already recomputes the stats, and it can compare
  them against the persisted stats and judge whether there is a divergence. If there
  is one, naively one may hope to just insert the newly computed diff into the range,
  but this is not ideal due to concerns about double application and racing consistency
  checks. Instead, use `AdjustStats` (which, of course, was introduced for this purpose)
  which strikes a balance between efficiency and correctness.

Updates cockroachdb#20554.

Release note (general change): added a mechanism to recompute range stats on demand.
tbg added a commit to tbg/cockroach that referenced this issue Jan 11, 2018
The semantics for computing GCBytesAge were incorrect and are fixed in
this commit. Prior to this commit, a non-live write would accrue
GCBytesAge from its own timestamp on. That is, if you wrote two versions
of a key at 1s and 2s, then when the older version is replaced (at 2s)
it would start out with one second of age (from 1s to 2s). However, the
key *really* became non-live at 2s, and should have had an age of zero.

By extension, imagine a table with lots of writes all dating back to
early 2017, and assume that today (early 2018) all these writes are
deleted (i.e. a tombstone placed on top of them). Prior to this commit,
each key would immediately get assigned an age of `(early 2018) - early
2017)`, i.e. a very large number. Yet, the GC queue could only garbage
collect them after (early 2018) + TTL`, so by default 25 hours after
the deletion. We use GCBytesAge to trigger the GC queue, so that would
cause the GC queue to run without ever getting to remove anything, for
the TTL. This was a big problem bound to be noticed by users.

This commit changes the semantics to what the GCQueue (and the layman)
expects:

1. when a version is shadowed, it becomes non-live at that point and
   also starts accruing GCBytesAge from that point on.
2. deletion tombstones are an exception: They accrue age from their
   own timestamp on. This makes sense because a tombstone can be
   deleted whenever it's older than the TTL (as opposed to a value,
   which can only be deleted once it's been *shadowed* for longer
   than the TTL).

This work started out by updating `ComputeStatsGo` to have the desired
semantics, fixing up existing tests, and then stress testing
`TestMVCCStatsRandomized` with short history lengths to discover failure
modes which were then transcribed into small unit tests. When no more
such failures were discoverable, the resulting logic in the various
incremental MVCCStats update helpers was simplified and documented, and
`ComputeStats` updated accordingly. In turn, `TestMVCCStatsBasic` was
removed: it was notoriously hard to read and maintain, and does not add
any coverage at this point.

The recomputation of the stats in existing clusters is addressed in
cockroachdb#21345.

Fixes cockroachdb#20554.
tbg added a commit to tbg/cockroach that referenced this issue Jan 12, 2018
A number of bugs in our MVCCStats logic has been fixed recently (see for example
\cockroachdb#20996) and others are still present (cockroachdb#20554). For both, and also potentially for
future bugs or deliberate adjustments of the computations, we need a mechanism
to recompute the stats in order to purge incorrect numbers from the system over
time. Such a mechanism is introduced here.

It consists of two main components:

- A new RPC `RecomputeStats`, which applies to a single Range and
computes the difference between the persisted stats and the recomputed
ones; it can "inject" a suitable delta into the stats (thus fixing the
divergence) or do a "dry run".
- A trigger in the consistency checker that runs on the coordinating
node (the lease holder). The consistency checker already recomputes the
stats, and it can compare them against the persisted stats and judge
whether there is a divergence. If there is one, naively one may hope to
just insert the newly computed diff into the range, but this is not
ideal due to concerns about double application and racing consistency
checks. Instead, use `RecomputeStats` (which, of course, was introduced
for this purpose) which strikes a balance between efficiency and
correctness.

Updates cockroachdb#20554.

Release note (general change): added a mechanism to recompute range
stats on demand.
tbg added a commit to tbg/cockroach that referenced this issue Jan 16, 2018
The semantics for computing GCBytesAge were incorrect and are fixed in
this commit. Prior to this commit, a non-live write would accrue
GCBytesAge from its own timestamp on. That is, if you wrote two versions
of a key at 1s and 2s, then when the older version is replaced (at 2s)
it would start out with one second of age (from 1s to 2s). However, the
key *really* became non-live at 2s, and should have had an age of zero.

By extension, imagine a table with lots of writes all dating back to
early 2017, and assume that today (early 2018) all these writes are
deleted (i.e. a tombstone placed on top of them). Prior to this commit,
each key would immediately get assigned an age of `(early 2018) - early
2017)`, i.e. a very large number. Yet, the GC queue could only garbage
collect them after (early 2018) + TTL`, so by default 25 hours after
the deletion. We use GCBytesAge to trigger the GC queue, so that would
cause the GC queue to run without ever getting to remove anything, for
the TTL. This was a big problem bound to be noticed by users.

This commit changes the semantics to what the GCQueue (and the layman)
expects:

1. when a version is shadowed, it becomes non-live at that point and
   also starts accruing GCBytesAge from that point on.
2. deletion tombstones are an exception: They accrue age from their
   own timestamp on. This makes sense because a tombstone can be
   deleted whenever it's older than the TTL (as opposed to a value,
   which can only be deleted once it's been *shadowed* for longer
   than the TTL).

This work started out by updating `ComputeStatsGo` to have the desired
semantics, fixing up existing tests, and then stress testing
`TestMVCCStatsRandomized` with short history lengths to discover failure
modes which were then transcribed into small unit tests. When no more
such failures were discoverable, the resulting logic in the various
incremental MVCCStats update helpers was simplified and documented, and
`ComputeStats` updated accordingly. In turn, `TestMVCCStatsBasic` was
removed: it was notoriously hard to read and maintain, and does not add
any coverage at this point.

The recomputation of the stats in existing clusters is addressed in
cockroachdb#21345.

Fixes cockroachdb#20554.

Release note (bug fix): fix a problem that could cause spurious GC
activity, in particular after dropping a table.
tbg added a commit to tbg/cockroach that referenced this issue Jan 16, 2018
The semantics for computing GCBytesAge were incorrect and are fixed in
this commit. Prior to this commit, a non-live write would accrue
GCBytesAge from its own timestamp on. That is, if you wrote two versions
of a key at 1s and 2s, then when the older version is replaced (at 2s)
it would start out with one second of age (from 1s to 2s). However, the
key *really* became non-live at 2s, and should have had an age of zero.

By extension, imagine a table with lots of writes all dating back to
early 2017, and assume that today (early 2018) all these writes are
deleted (i.e. a tombstone placed on top of them). Prior to this commit,
each key would immediately get assigned an age of `(early 2018) - early
2017)`, i.e. a very large number. Yet, the GC queue could only garbage
collect them after (early 2018) + TTL`, so by default 25 hours after
the deletion. We use GCBytesAge to trigger the GC queue, so that would
cause the GC queue to run without ever getting to remove anything, for
the TTL. This was a big problem bound to be noticed by users.

This commit changes the semantics to what the GCQueue (and the layman)
expects:

1. when a version is shadowed, it becomes non-live at that point and
   also starts accruing GCBytesAge from that point on.
2. deletion tombstones are an exception: They accrue age from their
   own timestamp on. This makes sense because a tombstone can be
   deleted whenever it's older than the TTL (as opposed to a value,
   which can only be deleted once it's been *shadowed* for longer
   than the TTL).

This work started out by updating `ComputeStatsGo` to have the desired
semantics, fixing up existing tests, and then stress testing
`TestMVCCStatsRandomized` with short history lengths to discover failure
modes which were then transcribed into small unit tests. When no more
such failures were discoverable, the resulting logic in the various
incremental MVCCStats update helpers was simplified and documented, and
`ComputeStats` updated accordingly. In turn, `TestMVCCStatsBasic` was
removed: it was notoriously hard to read and maintain, and does not add
any coverage at this point.

The recomputation of the stats in existing clusters is addressed in
cockroachdb#21345.

Fixes cockroachdb#20554.

Release note (bug fix): fix a problem that could cause spurious GC
activity, in particular after dropping a table.
tbg added a commit to tbg/cockroach that referenced this issue Jan 17, 2018
A number of bugs in our MVCCStats logic has been fixed recently (see for example
\cockroachdb#20996) and others are still present (cockroachdb#20554). For both, and also potentially for
future bugs or deliberate adjustments of the computations, we need a mechanism
to recompute the stats in order to purge incorrect numbers from the system over
time. Such a mechanism is introduced here.

It consists of two main components:

- A new RPC `RecomputeStats`, which applies to a single Range and
computes the difference between the persisted stats and the recomputed
ones; it can "inject" a suitable delta into the stats (thus fixing the
divergence) or do a "dry run".
- A trigger in the consistency checker that runs on the coordinating
node (the lease holder). The consistency checker already recomputes the
stats, and it can compare them against the persisted stats and judge
whether there is a divergence. If there is one, naively one may hope to
just insert the newly computed diff into the range, but this is not
ideal due to concerns about double application and racing consistency
checks. Instead, use `RecomputeStats` (which, of course, was introduced
for this purpose) which strikes a balance between efficiency and
correctness.

Updates cockroachdb#20554.

Release note (general change): added a mechanism to recompute range
stats on demand.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants