-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: unexpected GC queue activity immediately after DROP #20554
Comments
Yeah, as suspected - we get a large score, but don't get to GC anything. That's unexpected as I thought this was a problem already addressed. Taking a look.
|
Yikes, so here's a problem that I just verified in a unit test. Assume that at timestamp 0, you write a key-value pair that is 1MB in total. Then, at timestamp 1000E10 (1000 seconds later) you delete it (i..e write a tombstone at 1000E10). What should be the GCBytesAge at The answer is zero, as the data has been deleted for 0 seconds. But, problematically, we actually compute a This explains the behavior I'm seeing: the restore I'm running has all of its data written sometime in February this year. As it gets deleted (MVCC tombstoned), it should start out with a |
The stats are fixable, but it's not clear how to best roll out that fix (we'd need to force full stats recomputations on everything). I'll whip up a fix and then we can worry on how it would actually be retrofitted to existing data. |
I spent the last couple of hours trying to fix the stats computation, and while I think I got it mostly right, there's an unfortunate complication regarding the two extra implementations of stats keeping in To illustrate this, consider the case in which one
In the MVCC layer, the stats would compute as follows:
The iterator will instead visit the keys in reverse order, seeing first the tombstone (total size 10) and then the value (total size 150). But it has to augment GCBytesAge when it sees the tombstone, and it hasn't seen the value yet (which carries the bulk of the weight). Generalizing this to the case in which there are long chains of deletions and recreations, it becomes obvious that the iteration needs to be in reverse order for the computation to be possible (at which point you can build up an MVCCStats organically). The property that was used to make it work with forward iteration was that the old computation needed only the local key to determine its ultimate impact on GCBytesAge. But that was incorrect. For extra fun, a copy of that code exists in C++ as well. |
I've made some progress rewriting the Go iterator version but it'll take a few more days hunkering down and adding commentary. There's also a some complexity that leaks into |
@tschottdorf In your rewrite, you need to call |
@petermattis I'm aware of these limitations and am equally concerned about them. Let's chat next week. I'm not thrilled about a slice as we couldn't bound its size. Have to think some more about this. |
I forgot to write this earlier: I think we can still do the forward iteration without the need for a slice. This is hopefully less bad than I think (don't mark my words just yet, though). |
The good news is that I managed to convince myself that fixing Instead, a snag I've hit is with the incremental updates of stats during writes. To see what the problem is, think about aborting an intent in today's code. The code that does this naively sees only the intent's meta record, but that is not enough to reconstruct the previous stats. This is easy to see if the intent is a deletion: there is no information about the previous value in the meta (it's just a tombstone), yet the live bytes already reflect the previous value as having been deleted (NB: if we changed this, committing would be harder instead). Once I make the change that ties the On top of fixing this, we need to also discuss whether or how to fix the incorrect stats out there. My hope is that we can make do with a workaround in the GC queue: if the GC cycle finishes without having gotten to delete anything, (in-memory) mark the replica as "non-gc'able" for the next |
Actually it seems rather reasonable that the GC queue would compute the associated stats mismatch while it's going through all the keys and versions anyway, and then tack on a diff to apply. We can compute this efficiently by refactoring |
To address cockroachdb#20554, we'll need to use this in more places. Release note: None
Found while investigating cockroachdb#20554 (though this does not at all fix the issue discovered there; that will be a separate PR). We were incorrectly updating GCBytesAge when moving an intent. This change raises the question of what to do about the divergent stats that exist in real-world clusters. As part of addressing cockroachdb#20554, we'll need a mechanism to correct the stats anyway, and so I will defer its introduction. You'll want to view this diff with `?w=1` (insensitive to whitespace changes). Release note: None.
To address cockroachdb#20554, we'll need to use this in more places. Release note: None
Found while investigating cockroachdb#20554 (though this does not at all fix the issue discovered there; that will be a separate PR). We were incorrectly updating GCBytesAge when moving an intent. This change raises the question of what to do about the divergent stats that exist in real-world clusters. As part of addressing cockroachdb#20554, we'll need a mechanism to correct the stats anyway, and so I will defer its introduction. You'll want to view this diff with `?w=1` (insensitive to whitespace changes). Release note: None.
To address cockroachdb#20554, we'll need to use this in more places. Release note: None
Found while investigating cockroachdb#20554 (though this does not at all fix the issue discovered there; that will be a separate PR). We were incorrectly updating GCBytesAge when moving an intent. The old code was pretty broken (and, with hindsight, it is still after this commit, as is exposed by the various child commits). Its main problem was that it failed to account for the `GCBytesAge` difference that would result from moving the intent due to the incorrect assumption that the size of the intent would be the same. The code was also somewhat intransparent and an attempt has been made to improve its legibility. This change raises the question of what to do about the divergent stats that exist in real-world clusters. As part of addressing cockroachdb#20554, we'll need a mechanism to correct the stats anyway, and so I will defer its introduction. You'll want to view this diff with `?w=1` (insensitive to whitespace changes). Release note: None.
Found while investigating cockroachdb#20554 (though this does not at all fix the issue discovered there; that will be a separate PR). We were incorrectly updating GCBytesAge when moving an intent. The old code was pretty broken (and, with hindsight, it is still after this commit, as is exposed by the various child commits). Its main problem was that it failed to account for the `GCBytesAge` difference that would result from moving the intent due to the incorrect assumption that the size of the intent would be the same. The code was also somewhat intransparent and an attempt has been made to improve its legibility. This change raises the question of what to do about the divergent stats that exist in real-world clusters. As part of addressing cockroachdb#20554, we'll need a mechanism to correct the stats anyway, and so I will defer its introduction. You'll want to view this diff with `?w=1` (insensitive to whitespace changes). Release note: None.
RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration. Once activated, only the new tombstone is written, but both tombstones are checked. Additionally, as the node restarts, all legacy tombstones are replaced by correct non-legacy ones. - when applying a snapshot, forcibly delete any legacy tombstone contained within (even before the cluster version is bumped). This prevents new legacy tombstones to trickle on from other nodes. `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.1, as at that point all peers have booted with a version that runs the migration. Thus, post v2.1, the replica consistency checker can stop skipping the legacy tombstone key. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070
RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration. Once activated, only the new tombstone is written, but both tombstones are checked. Additionally, as the node restarts, all legacy tombstones are replaced by correct non-legacy ones. - when applying a snapshot, forcibly delete any legacy tombstone contained within (even before the cluster version is bumped). This prevents new legacy tombstones to trickle on from other nodes. `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.1, as at that point all peers have booted with a version that runs the migration. Thus, post v2.1, the replica consistency checker can stop skipping the legacy tombstone key. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070
A number of bugs in our MVCCStats logic has been fixed recently (see for example \cockroachdb#20996) and others are still present (cockroachdb#20554). For both, and also potentially for future bugs or deliberate adjustments of the computations, we need a mechanism to recompute the stats in order to purge incorrect numbers from the system over time. Such a mechanism is introduced here. It consists of two main components: - A new RPC `AdjustStats`, which applies to a single Range and computes the difference between the persisted stats and the recomputed ones; it can "inject" a suitable delta into the stats (thus fixing the divergence) or do a "dry run". - A trigger in the consistency checker that runs on the coordinating node (the lease holder). The consistency checker already recomputes the stats, and it can compare them against the persisted stats and judge whether there is a divergence. If there is one, naively one may hope to just insert the newly computed diff into the range, but this is not ideal due to concerns about double application and racing consistency checks. Instead, use `AdjustStats` (which, of course, was introduced for this purpose) which strikes a balance between efficiency and correctness. Updates cockroachdb#20554. Release note (general change): added a mechanism to recompute range stats on demand.
The semantics for computing GCBytesAge were incorrect and are fixed in this commit. Prior to this commit, a non-live write would accrue GCBytesAge from its own timestamp on. That is, if you wrote two versions of a key at 1s and 2s, then when the older version is replaced (at 2s) it would start out with one second of age (from 1s to 2s). However, the key *really* became non-live at 2s, and should have had an age of zero. By extension, imagine a table with lots of writes all dating back to early 2017, and assume that today (early 2018) all these writes are deleted (i.e. a tombstone placed on top of them). Prior to this commit, each key would immediately get assigned an age of `(early 2018) - early 2017)`, i.e. a very large number. Yet, the GC queue could only garbage collect them after (early 2018) + TTL`, so by default 25 hours after the deletion. We use GCBytesAge to trigger the GC queue, so that would cause the GC queue to run without ever getting to remove anything, for the TTL. This was a big problem bound to be noticed by users. This commit changes the semantics to what the GCQueue (and the layman) expects: 1. when a version is shadowed, it becomes non-live at that point and also starts accruing GCBytesAge from that point on. 2. deletion tombstones are an exception: They accrue age from their own timestamp on. This makes sense because a tombstone can be deleted whenever it's older than the TTL (as opposed to a value, which can only be deleted once it's been *shadowed* for longer than the TTL). This work started out by updating `ComputeStatsGo` to have the desired semantics, fixing up existing tests, and then stress testing `TestMVCCStatsRandomized` with short history lengths to discover failure modes which were then transcribed into small unit tests. When no more such failures were discoverable, the resulting logic in the various incremental MVCCStats update helpers was simplified and documented, and `ComputeStats` updated accordingly. In turn, `TestMVCCStatsBasic` was removed: it was notoriously hard to read and maintain, and does not add any coverage at this point. The recomputation of the stats in existing clusters is addressed in cockroachdb#21345. Fixes cockroachdb#20554.
A number of bugs in our MVCCStats logic has been fixed recently (see for example \cockroachdb#20996) and others are still present (cockroachdb#20554). For both, and also potentially for future bugs or deliberate adjustments of the computations, we need a mechanism to recompute the stats in order to purge incorrect numbers from the system over time. Such a mechanism is introduced here. It consists of two main components: - A new RPC `RecomputeStats`, which applies to a single Range and computes the difference between the persisted stats and the recomputed ones; it can "inject" a suitable delta into the stats (thus fixing the divergence) or do a "dry run". - A trigger in the consistency checker that runs on the coordinating node (the lease holder). The consistency checker already recomputes the stats, and it can compare them against the persisted stats and judge whether there is a divergence. If there is one, naively one may hope to just insert the newly computed diff into the range, but this is not ideal due to concerns about double application and racing consistency checks. Instead, use `RecomputeStats` (which, of course, was introduced for this purpose) which strikes a balance between efficiency and correctness. Updates cockroachdb#20554. Release note (general change): added a mechanism to recompute range stats on demand.
The semantics for computing GCBytesAge were incorrect and are fixed in this commit. Prior to this commit, a non-live write would accrue GCBytesAge from its own timestamp on. That is, if you wrote two versions of a key at 1s and 2s, then when the older version is replaced (at 2s) it would start out with one second of age (from 1s to 2s). However, the key *really* became non-live at 2s, and should have had an age of zero. By extension, imagine a table with lots of writes all dating back to early 2017, and assume that today (early 2018) all these writes are deleted (i.e. a tombstone placed on top of them). Prior to this commit, each key would immediately get assigned an age of `(early 2018) - early 2017)`, i.e. a very large number. Yet, the GC queue could only garbage collect them after (early 2018) + TTL`, so by default 25 hours after the deletion. We use GCBytesAge to trigger the GC queue, so that would cause the GC queue to run without ever getting to remove anything, for the TTL. This was a big problem bound to be noticed by users. This commit changes the semantics to what the GCQueue (and the layman) expects: 1. when a version is shadowed, it becomes non-live at that point and also starts accruing GCBytesAge from that point on. 2. deletion tombstones are an exception: They accrue age from their own timestamp on. This makes sense because a tombstone can be deleted whenever it's older than the TTL (as opposed to a value, which can only be deleted once it's been *shadowed* for longer than the TTL). This work started out by updating `ComputeStatsGo` to have the desired semantics, fixing up existing tests, and then stress testing `TestMVCCStatsRandomized` with short history lengths to discover failure modes which were then transcribed into small unit tests. When no more such failures were discoverable, the resulting logic in the various incremental MVCCStats update helpers was simplified and documented, and `ComputeStats` updated accordingly. In turn, `TestMVCCStatsBasic` was removed: it was notoriously hard to read and maintain, and does not add any coverage at this point. The recomputation of the stats in existing clusters is addressed in cockroachdb#21345. Fixes cockroachdb#20554. Release note (bug fix): fix a problem that could cause spurious GC activity, in particular after dropping a table.
The semantics for computing GCBytesAge were incorrect and are fixed in this commit. Prior to this commit, a non-live write would accrue GCBytesAge from its own timestamp on. That is, if you wrote two versions of a key at 1s and 2s, then when the older version is replaced (at 2s) it would start out with one second of age (from 1s to 2s). However, the key *really* became non-live at 2s, and should have had an age of zero. By extension, imagine a table with lots of writes all dating back to early 2017, and assume that today (early 2018) all these writes are deleted (i.e. a tombstone placed on top of them). Prior to this commit, each key would immediately get assigned an age of `(early 2018) - early 2017)`, i.e. a very large number. Yet, the GC queue could only garbage collect them after (early 2018) + TTL`, so by default 25 hours after the deletion. We use GCBytesAge to trigger the GC queue, so that would cause the GC queue to run without ever getting to remove anything, for the TTL. This was a big problem bound to be noticed by users. This commit changes the semantics to what the GCQueue (and the layman) expects: 1. when a version is shadowed, it becomes non-live at that point and also starts accruing GCBytesAge from that point on. 2. deletion tombstones are an exception: They accrue age from their own timestamp on. This makes sense because a tombstone can be deleted whenever it's older than the TTL (as opposed to a value, which can only be deleted once it's been *shadowed* for longer than the TTL). This work started out by updating `ComputeStatsGo` to have the desired semantics, fixing up existing tests, and then stress testing `TestMVCCStatsRandomized` with short history lengths to discover failure modes which were then transcribed into small unit tests. When no more such failures were discoverable, the resulting logic in the various incremental MVCCStats update helpers was simplified and documented, and `ComputeStats` updated accordingly. In turn, `TestMVCCStatsBasic` was removed: it was notoriously hard to read and maintain, and does not add any coverage at this point. The recomputation of the stats in existing clusters is addressed in cockroachdb#21345. Fixes cockroachdb#20554. Release note (bug fix): fix a problem that could cause spurious GC activity, in particular after dropping a table.
A number of bugs in our MVCCStats logic has been fixed recently (see for example \cockroachdb#20996) and others are still present (cockroachdb#20554). For both, and also potentially for future bugs or deliberate adjustments of the computations, we need a mechanism to recompute the stats in order to purge incorrect numbers from the system over time. Such a mechanism is introduced here. It consists of two main components: - A new RPC `RecomputeStats`, which applies to a single Range and computes the difference between the persisted stats and the recomputed ones; it can "inject" a suitable delta into the stats (thus fixing the divergence) or do a "dry run". - A trigger in the consistency checker that runs on the coordinating node (the lease holder). The consistency checker already recomputes the stats, and it can compare them against the persisted stats and judge whether there is a divergence. If there is one, naively one may hope to just insert the newly computed diff into the range, but this is not ideal due to concerns about double application and racing consistency checks. Instead, use `RecomputeStats` (which, of course, was introduced for this purpose) which strikes a balance between efficiency and correctness. Updates cockroachdb#20554. Release note (general change): added a mechanism to recompute range stats on demand.
Experimentation notes. I'm running single-node release-1.1 with a tpch.lineitem
(SF 1) table restore. Without changing the TTL, I dropped this table last night.
The "live bytes" fell to ~zero within 30 minutes (i.e., it took 30 minutes for
all keys to be deleted, but not cleared yet) while on disk we're now using 1.7GB
instead of 1.3GB (makes sense since we wrote lots of MVCC tombstones).
What stuck out is that while this was going on, I saw lots of unexpected GC runs
that didn't get to delete data. I initially thought those must have been
triggered by the "intent age" (which spikes as the range deletion puts down many
many intents that are only cleaned up after commit; they're likely visible for
too long and get the replica queued). But what speaks against this theory is
that all night, GC was running in circles, apparently always triggered but never
successful at reducing the score. This strikes me as quite odd and needs more
investigation.
This morning, I changed the TTL to 100s and am seeing steady GC queue activity,
each run clearing out a whole range and making steady progress. Annoyingly, the
consistency checker is also running all the time, which can't help performance.
The GC queue took around 18 minutes to clean up ~1.3 on-disk-data worth of data,
which seems OK. After the run, the data directory stabilized at 200-300MB, which
after an offline-compaction drops to 8MB.
RocksDB seems to be running compactions, since the data directory (at the time
of writing) has dropped to 613MB and within a minute more to 419MB (with some
jittering). Logging output is quiet, memory usage is stable, though I'm sometimes
seeing 25 GC runs logged in the runtime stats which I think is higher than I am
used to seeing (the GC queue is not allocation efficient, so that makes some sense
to me).
Running the experiment again to look specifically into the first part.
The text was updated successfully, but these errors were encountered: