Skip to content

Commit

Permalink
storage: fix (the semantics of) MVCCStats.GCBytesAge
Browse files Browse the repository at this point in the history
The semantics for computing GCBytesAge were incorrect and are fixed in
this commit. Prior to this commit, a non-live write would accrue
GCBytesAge from its own timestamp on. That is, if you wrote two versions
of a key at 1s and 2s, then when the older version is replaced (at 2s)
it would start out with one second of age (from 1s to 2s). However, the
key *really* became non-live at 2s, and should have had an age of zero.

By extension, imagine a table with lots of writes all dating back to
early 2017, and assume that today (early 2018) all these writes are
deleted (i.e. a tombstone placed on top of them). Prior to this commit,
each key would immediately get assigned an age of `(early 2018) - early
2017)`, i.e. a very large number. Yet, the GC queue could only garbage
collect them after (early 2018) + TTL`, so by default 25 hours after
the deletion. We use GCBytesAge to trigger the GC queue, so that would
cause the GC queue to run without ever getting to remove anything, for
the TTL. This was a big problem bound to be noticed by users.

This commit changes the semantics to what the GCQueue (and the layman)
expects:

1. when a version is shadowed, it becomes non-live at that point and
   also starts accruing GCBytesAge from that point on.
2. deletion tombstones are an exception: They accrue age from their
   own timestamp on. This makes sense because a tombstone can be
   deleted whenever it's older than the TTL (as opposed to a value,
   which can only be deleted once it's been *shadowed* for longer
   than the TTL).

This work started out by updating `ComputeStatsGo` to have the desired
semantics, fixing up existing tests, and then stress testing
`TestMVCCStatsRandomized` with short history lengths to discover failure
modes which were then transcribed into small unit tests. When no more
such failures were discoverable, the resulting logic in the various
incremental MVCCStats update helpers was simplified and documented, and
`ComputeStats` updated accordingly. In turn, `TestMVCCStatsBasic` was
removed: it was notoriously hard to read and maintain, and does not add
any coverage at this point.

The recomputation of the stats in existing clusters is addressed in
cockroachdb#21345.

Fixes cockroachdb#20554.

Release note (bug fix): fix a problem that could cause spurious GC
activity, in particular after dropping a table.
  • Loading branch information
tbg committed Jan 16, 2018
1 parent 7ce795a commit f274e6a
Show file tree
Hide file tree
Showing 4 changed files with 980 additions and 417 deletions.
24 changes: 19 additions & 5 deletions c-deps/libroach/db.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2356,6 +2356,12 @@ MVCCStatsResult MVCCComputeStatsInternal(::rocksdb::Iterator* const iter_rep, DB
cockroach::storage::engine::enginepb::MVCCMetadata meta;
std::string prev_key;
bool first = false;
// NB: making this uninitialized triggers compiler warnings
// with `-Werror=maybe-uninitialized`. This warning seems like
// a false positive (changing the above line to `first=true`
// which results in equivalent code does not remove it either).
// An assertion has been placed where the compiler would warn.
int64_t accrue_gc_age_nanos = 0;

for (; iter_rep->Valid() && kComparator.Compare(iter_rep->key(), end_key) < 0; iter_rep->Next()) {
const rocksdb::Slice key = iter_rep->key();
Expand All @@ -2366,7 +2372,7 @@ MVCCStatsResult MVCCComputeStatsInternal(::rocksdb::Iterator* const iter_rep, DB
int32_t logical = 0;
if (!DecodeKey(key, &decoded_key, &wall_time, &logical)) {
stats.status = FmtStatus("unable to decode key");
break;
return stats;
}

const bool isSys = (rocksdb::Slice(decoded_key).compare(kLocalMax) < 0);
Expand All @@ -2391,7 +2397,7 @@ MVCCStatsResult MVCCComputeStatsInternal(::rocksdb::Iterator* const iter_rep, DB

if (!implicitMeta && !meta.ParseFromArray(value.data(), value.size())) {
stats.status = FmtStatus("unable to decode MVCCMetadata");
break;
return stats;
}

if (isSys) {
Expand Down Expand Up @@ -2435,15 +2441,23 @@ MVCCStatsResult MVCCComputeStatsInternal(::rocksdb::Iterator* const iter_rep, DB
if (meta.key_bytes() != kMVCCVersionTimestampSize) {
stats.status = FmtStatus("expected mvcc metadata key bytes to equal %d; got %d",
kMVCCVersionTimestampSize, int(meta.key_bytes()));
break;
return stats;
}
if (meta.val_bytes() != value.size()) {
stats.status = FmtStatus("expected mvcc metadata val bytes to equal %d; got %d",
int(value.size()), int(meta.val_bytes()));
break;
return stats;
}
accrue_gc_age_nanos = meta.timestamp().wall_time();
} else {
stats.gc_bytes_age += total_bytes * age_factor(wall_time, now_nanos);
bool is_tombstone = value.size() == 0;
if (is_tombstone) {
stats.gc_bytes_age += total_bytes * age_factor(wall_time, now_nanos);
} else {
assert(accrue_gc_age_nanos > 0);
stats.gc_bytes_age += total_bytes * age_factor(accrue_gc_age_nanos, now_nanos);
}
accrue_gc_age_nanos = wall_time;
}
stats.key_bytes += kMVCCVersionTimestampSize;
stats.val_bytes += value.size();
Expand Down
Loading

0 comments on commit f274e6a

Please sign in to comment.