Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libroach: timebound iterator metadata doesn't consider intents #28358

Closed
danhhz opened this issue Aug 7, 2018 · 84 comments · Fixed by #31290
Closed

libroach: timebound iterator metadata doesn't consider intents #28358

danhhz opened this issue Aug 7, 2018 · 84 comments · Fixed by #31290
Assignees
Labels
A-cdc Change Data Capture A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Milestone

Comments

@danhhz
Copy link
Contributor

danhhz commented Aug 7, 2018

Note: I haven't confirmed this behavior in practice, it's currently theoretical from reading code. Holding off on the severity label until it's confirmed. The fix will need to be backported into 2.0

When computing which timestamps are present in an sstable for the time bound iterator performance optimization, all keys without a timestamp are ignored, which includes intents. This means that certain (probably unlikely) compactions will result in a situation where a time bounded iterator reading from [t1,t2] will miss an intent at t2.

rocksdb::Status AddUserKey(const rocksdb::Slice& user_key, const rocksdb::Slice& value,
rocksdb::EntryType type, rocksdb::SequenceNumber seq,
uint64_t file_size) override {
rocksdb::Slice unused;
rocksdb::Slice ts;
if (SplitKey(user_key, &unused, &ts) && !ts.empty()) {
ts.remove_prefix(1); // The NUL prefix.
if (ts_max_.empty() || ts.compare(ts_max_) > 0) {
ts_max_.assign(ts.data(), ts.size());
}
if (ts_min_.empty() || ts.compare(ts_min_) < 0) {
ts_min_.assign(ts.data(), ts.size());
}
}
return rocksdb::Status::OK();
}

Backup uses timebounded iterators and does a read at a certain timestamp until all intents below that timestamp have been resolved one way or another. This bug means that backup could incorrectly proceed with an unresolved intent. If this intent is later rolled back, the backup would contain a value that was never committed.

CDC will have similar correctness issues.

Several fixes are possible with various performance tradeoffs. An partial list:

  • Include intents in the time bound metadata. This requires unmarshalling every mvcc metadata proto in the sst and is probably too expensive, especially for inline values like timeseries.
  • Disable time bound metadata for any sstable containing intents. This will cause the time bound iterator to always look in the file, so will be correct but a performance pessimization.
  • Some combination of the two based on how many intents are in the file or where in the keyspace it is.
@danhhz danhhz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery A-cdc Change Data Capture labels Aug 7, 2018
@danhhz danhhz added this to the 2.1 milestone Aug 7, 2018
@benesch
Copy link
Contributor

benesch commented Aug 8, 2018

Backup uses timebounded iterators and does a read at a certain timestamp until all intents below that timestamp have been resolved one way or another. This bug means that backup could incorrectly proceed with an unresolved intent. If this intent is later rolled back, the backup would contain a value that was never committed.

I don't think this exact scenario is possible. If the time bound iterator accidentally excludes a lingering intent from an aborted transaction, it's no problem, as the backup wasn't supposed to contain that intent anyway. The problem, as far as I can see, arises when the iterator excludes a lingering intent from a committed transaction; that intent needed to be promoted to a write, but the backup will not include that write.

@nvanbenschoten
Copy link
Member

When an intent is laid down, we write two keys: the MVCCMetadata key at ts=0 that indicates that an intent exists and the provisional value at ts=value_ts that holds the intent's value. When we commit an intent, we actually just delete the MVCCMetadata key, effectively "revealing" the provisional value as committed.

I think the issue here is that if an intent's MVCCMetadata key is split in a different sst from its provisional value key, it's possible that a timebound iterator scans only over the sst with the provisional value key. In that case, the value will look committed even though it's not! I think the hazard arises when the intent is either aborted or pushed before being committed. If this happens, the backup would contain a value that was never committed or was committed at a later timestamp.

@benesch
Copy link
Contributor

benesch commented Aug 8, 2018

Gah! My understanding was still subtly wrong. Thanks for the clarification, @nvanbenschoten.

There's now two other solutions, though I'm not sure how well supported by RocksDB they are. One is to simply disallow splitting a metadata key from its actual values. The other is to only look at the last key in an SST. If it's a metadata key, then we need to look in and see if there's an intent. Only in that case do we need to erase the timestamp bounds on the SST.

@benesch
Copy link
Contributor

benesch commented Aug 8, 2018

Also, no wonder we haven't seen this in practice.

@nstewart nstewart added the S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. label Sep 18, 2018
@benesch benesch added S-1 High impact: many users impacted, serious risk of high unavailability or data loss and removed S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. labels Oct 10, 2018
@benesch
Copy link
Contributor

benesch commented Oct 10, 2018

I'm upgrading this to S-1 to hopefully get more visibility. I think this needs to get fixed for v2.1. We haven't seen this in practice, but that's the nature of this bug: you would never notice when it happened.

@petermattis putting this on your radar. Is there anyone who can take a look at this?

@petermattis
Copy link
Collaborator

One is to simply disallow splitting a metadata key from its actual values.

This isn't possible as far as I'm aware.

The other is to only look at the last key in an SST. If it's a metadata key, then we need to look in and see if there's an intent.

Are you thinking this would be done at compaction time?

Is there anyone who can take a look at this?

Hrmm, the number of folks familiar with the C++ code is too small.

@benesch
Copy link
Contributor

benesch commented Oct 11, 2018

Are you thinking this would be done at compaction time?

Yeah, at SST build time. I'm hoping there's some way to do this with the table properties collector interface.

@danhhz
Copy link
Contributor Author

danhhz commented Oct 11, 2018

Yeah every time you see an intent, copy the serialized proto to a field in the table prop collector. Every time you see a non intent, clear it. Then when the finalize method is called, if the field is non empty, deserialize the proto and process the timestamp

@petermattis
Copy link
Collaborator

I can take a crack at doing this, but may require some assistance in setting up a test scenario which fails. Sounds like what we want is a situation where there are two sstables where the first contains the MVCC metadata key and the second contains the provisional key. I'm thinking I can create that scenario using ingested sstables. Then I perform a backup and the sstable which contains the provisional key should be included in the backup incorrectly, right?

@benesch
Copy link
Contributor

benesch commented Oct 11, 2018

I can take a crack at doing this

🎉

Sounds like what we want is a situation where there are two sstables where the first contains the MVCC metadata key and the second contains the provisional key. I'm thinking I can create that scenario using ingested sstables. Then I perform a backup and the sstable which contains the provisional key should be included in the backup incorrectly, right?

Yes, that's my understanding. The SST with the metadata key (SSTm) will need to have time bounds that are older than the SST with the provisional key. Then the backup needs to be incremental on top of a full backup that occurred after the timestamp of any key in SSTm. I think. Getting such a full backup is going to be tricky. I think it'll be easier if you just create an MVCCIncrementalIterator over a store with the desired SSTs and make sure that you don't see the provisional key.

@danhhz
Copy link
Contributor Author

danhhz commented Oct 11, 2018

Backups never include provisional keys, they retry until it's resolved one way or the other. I think the high-level issue here is that the timebound iterator would incorrectly skip the sstable, and the backup wouldn't know it had to wait for it to be resolved.

@danhhz
Copy link
Contributor Author

danhhz commented Oct 11, 2018

Oh, I misunderstood what you meant by provisional key. What @benesch said.

@petermattis
Copy link
Collaborator

Yes, that's my understanding. The SST with the metadata key (SSTm) will need to have time bounds that are older than the SST with the provisional key. Then the backup needs to be incremental on top of a full backup that occurred after the timestamp of any key in SSTm. I think. Getting such a full backup is going to be tricky. I think it'll be easier if you just create an MVCCIncrementalIterator over a store with the desired SSTs and make sure that you don't see the provisional key.

So perhaps I should do this fully within storage/engine? Can one of you point me to the backup code which uses MVCCIncrementalIterator? Is this the code in ccl/storageccl/export.go?

@danhhz
Copy link
Contributor Author

danhhz commented Oct 11, 2018

Yeah, I think it's fine (and much simpler) to test at the storage/engine level. You found it:

iter := engineccl.NewMVCCIncrementalIterator(batch, engineccl.IterOptions{
StartTime: args.StartTime,
EndTime: h.Timestamp,
UpperBound: args.EndKey,
})
defer iter.Close()
for iter.Seek(engine.MakeMVCCMetadataKey(args.Key)); ; iterFn(iter) {
ok, err := iter.Valid()
if err != nil {
// The error may be a WriteIntentError. In which case, returning it will
// cause this command to be retried.
return result.Result{}, err
}
if !ok || iter.UnsafeKey().Key.Compare(args.EndKey) >= 0 {
break
}
// Skip tombstone (len=0) records when startTime is zero
// (non-incremental) and we're not exporting all versions.
if skipTombstones && args.StartTime.IsEmpty() && len(iter.UnsafeValue()) == 0 {
continue
}
if debugLog {
// Calling log.V is more expensive than you'd think. Keep it out of
// the hot path.
v := roachpb.Value{RawBytes: iter.UnsafeValue()}
log.Infof(ctx, "Export %s %s", iter.UnsafeKey(), v.PrettyPrint())
}
if err := rows.Count(iter.UnsafeKey().Key); err != nil {
return result.Result{}, errors.Wrapf(err, "decoding %s", iter.UnsafeKey())
}
if err := sst.Add(engine.MVCCKeyValue{Key: iter.UnsafeKey(), Value: iter.UnsafeValue()}); err != nil {
return result.Result{}, errors.Wrapf(err, "adding key %s", iter.UnsafeKey())
}
}

petermattis added a commit to petermattis/cockroach that referenced this issue Oct 12, 2018
…bles

An intent which straddles an sstable can lead an incremental iterator to
incorrectly ignore an sstable. In order to fix this, when an intent
straddles an sstable (i.e. the metadata key is the last key in the
sstable) we need to include the intent's timestamp in the timestamp
bounds. We don't need to do this for interior intents because we'll
already be including the intent's timestamp as it is contained in the
next key following the intent. Add
`TestMVCCIncrementalIteratorIntentStraddlesSStables` which demonstrates
the problem.

Fixes cockroachdb#28358

Release note (bug fix): Fix a rare scenario where a backup could
incorrectly include a key for a transaction which was aborted.
petermattis added a commit to petermattis/cockroach that referenced this issue Oct 12, 2018
…bles

An intent which straddles an sstable can lead an incremental iterator to
incorrectly ignore an sstable. In order to fix this, when an intent
straddles an sstable (i.e. the metadata key is the last key in the
sstable) we need to include the intent's timestamp in the timestamp
bounds. We don't need to do this for interior intents because we'll
already be including the intent's timestamp as it is contained in the
next key following the intent. Add
`TestMVCCIncrementalIteratorIntentStraddlesSStables` which demonstrates
the problem.

Fixes cockroachdb#28358

Release note (bug fix): Fix a rare scenario where a backup could
incorrectly include a key for a transaction which was aborted.
@a-robinson
Copy link
Contributor

Not sure it's workable

That's quite the understatement.

danhhz added a commit to danhhz/cockroach that referenced this issue Dec 11, 2018
For reasons described in cockroachdb#28358, a time-bound iterator will sometimes
see an unresolved intent where there is none. A normal iterator doesn't
have this problem, so we work around it in MVCCIncrementalIterator by
double checking all non-value keys. If the normal iterator has a
different value for the key, it's used instead. If the normal iterator
doesn't have that key, it's skipped.

This fixes both changefeeds and incremental backup.

Closes cockroachdb#32104
Closes cockroachdb#32799

Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no
longer indefinitely hang under an infrequent condition.
danhhz added a commit to danhhz/cockroach that referenced this issue Dec 13, 2018
For reasons described in cockroachdb#28358, a time-bound iterator will sometimes
see an unresolved intent where there is none. A normal iterator doesn't
have this problem, so we work around it in MVCCIncrementalIterator by
double checking all non-value keys. If the normal iterator has a
different value for the key, it's used instead. If the normal iterator
doesn't have that key, it's skipped.

This fixes both changefeeds and incremental backup.

Closes cockroachdb#32104
Closes cockroachdb#32799

Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no
longer indefinitely hang under an infrequent condition.
craig bot pushed a commit that referenced this issue Dec 13, 2018
32909: engineccl/mvcc: work around time-bound iterator bug r=bdarnell,benesch a=danhhz

For reasons described in #28358, a time-bound iterator will sometimes
see an unresolved intent where there is none. A normal iterator doesn't
have this problem, so we work around it in MVCCIncrementalIterator by
double checking all non-value keys. If the normal iterator has a
different value for the key, it's used instead. If the normal iterator
doesn't have that key, it's skipped.

This fixes both changefeeds and incremental backup.

Closes #32104
Closes #32799

Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no
longer indefinitely hang under an infrequent condition.

Co-authored-by: Daniel Harrison <[email protected]>
danhhz added a commit to danhhz/cockroach that referenced this issue Dec 13, 2018
For reasons described in cockroachdb#28358, a time-bound iterator will sometimes
see an unresolved intent where there is none. A normal iterator doesn't
have this problem, so we work around it in MVCCIncrementalIterator by
double checking all non-value keys. If the normal iterator has a
different value for the key, it's used instead. If the normal iterator
doesn't have that key, it's skipped.

This fixes both changefeeds and incremental backup.

Closes cockroachdb#32104
Closes cockroachdb#32799

Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no
longer indefinitely hang under an infrequent condition.
benesch added a commit to benesch/cockroach that referenced this issue Dec 18, 2018
This is a prototype of one of the proposed solutions to cockroachdb#28358, whereby
we never delete MVCCMetadata keys when resolving an intent. Instead, we
Put a new value that where Timestamp != nil (this is how the SST gets
the right bounds), Txn == nil (this indicates that this is the
resolution of an intent and not a new intent), and RawBytes == nil (to
indicate that this is not an inline value).

This has the downside of a) being extremely subtle, and b) wasting space
because writing a key transactionally now results in the metadata being
present forever. (We could probably work around b with a compaction
filter.)

Touches cockroachdb#28358.

Release note: None
@knz knz unassigned benesch Jan 13, 2019
@danhhz
Copy link
Contributor Author

danhhz commented Feb 5, 2019

I think it's fair to say this was re-closed by the workaround merged in #32909

@danhhz danhhz closed this as completed Feb 5, 2019
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 11, 2019
RangeFeed originally intended to use the time-bound iterator
performance optimization. However, they've had correctness issues in
the past (cockroachdb#28358, cockroachdb#34819) and no-one has the time for the due-diligence
necessary to be confidant in their correctness going forward. Not using
them causes the total time spent in RangeFeed catchup on changefeed
over tpcc-1000 to go from 40s -> 4853s, which is quite large but still
workable.

Closes cockroachdb#35122

Release note (enterprise change): In exchange for increased correctness
confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default)
now take slightly more resources on startup and range
rebalancing/splits.
craig bot pushed a commit that referenced this issue Mar 12, 2019
35470: rangefeed: stop using time-bound iterator for catchup scan r=tbg a=danhhz

RangeFeed originally intended to use the time-bound iterator
performance optimization. However, they've had correctness issues in
the past (#28358, #34819) and no-one has the time for the due-diligence
necessary to be confidant in their correctness going forward. Not using
them causes the total time spent in RangeFeed catchup on changefeed
over tpcc-1000 to go from 40s -> 4853s, which is quite large but still
workable.

Closes #35122

Release note (enterprise change): In exchange for increased correctness
confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default)
now take slightly more resources on startup and range
rebalancing/splits.

Co-authored-by: Daniel Harrison <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 13, 2019
…ents

Fixes cockroachdb#34819.

349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator.
This workaround created a second (non-time-bound) iterator to verify
possibly-phantom MVCCMetadata keys during iteration.

We found in cockroachdb#34819 that it is necessary for correctness that sanityIter
be created before the original iter. This is because the provided Reader
that both iterators are created from may not be a consistent snapshot, so
the two iterators could end up observing different information. The hack
around sanityCheckMetadataKey only works properly if all possible
discrepancies between the two iterators lead to intents and values falling
outside of the timestamp range **from the time-bound iterators perspective**.
This allows us to simply ignore discrepancies that we notice in advance().

This commit makes this change. It also adds a test that failed regularly
before the fix under stress and no longer fails after the fix.

Release note: None
craig bot pushed a commit that referenced this issue Mar 13, 2019
35300: sql: add checks after all referenced columns have been backfilled r=lucy-zhang a=lucy-zhang

Previously, if a column was added with a check constraint that also referenced
another column that was public, writes to that public column would erroneously
fail (and, in the worst case, result in a panic) while the column being added
was not yet public. With this change, the schema changer will now wait to add
the check constraint to the table descriptor until after all the columns that
were added in the same transaction have been backfilled.

A new optional field has been added to `ConstraintToValidate` for the check
constraint itself, so that it can be added to the table descriptor at the
correct step in the schema change process.

I ended up adding this field to the existing mutation instead of creating a new
type of mutation to add the constraint to the descriptor, since it ultimately
seemed to me that a mutation that simply adds a schema element in its backfill
step would be too inconsistent with what mutations are, especially since all
the mutations for a single transaction are basically executed at the same time.
To support NOT VALID in the future, we could add more flags to the protobuf to
indicate that either the addition of the constraint or the validation should be
skipped, so that they can be executed separately.

Fixes #35258, fixes #35193

Release note: None

35682: engineccl/mvcc: fix time-bound iterator's interaction with moving intents r=nvanbenschoten a=nvanbenschoten

Fixes #34819.

349ff61 introduced a workaround for #28358 into MVCCIncrementalIterator.
This workaround created a second (non-time-bound) iterator to verify
possibly-phantom MVCCMetadata keys during iteration.

We found in #34819 that it is necessary for correctness that sanityIter
be created before the original iter. This is because the provided Reader
that both iterators are created from may not be a consistent snapshot, so
the two iterators could end up observing different information. The hack
around sanityCheckMetadataKey only works properly if all possible
discrepancies between the two iterators lead to intents and values falling
outside of the timestamp range **from the time-bound iterator's perspective**.
This allows us to simply ignore discrepancies that we notice in advance().

This commit makes this change. It also adds a test that failed regularly
before the fix under stress and no longer fails after the fix.

Release note: None

35689: roachtest: add large node kv tests and batching kv tests r=nvanbenschoten a=nvanbenschoten

This commit adds support for running `kv` with a `--batch` parameter. It
then adds the following new roachtest configurations:
- kv0/enc=false/nodes=3/batch=16
- kv95/enc=false/nodes=3/batch=16
- kv0/enc=false/nodes=3/cpu=96
- kv95/enc=false/nodes=3/cpu=96
- kv50/enc=false/nodes=4/cpu=96/batch=64

The last test is currently skipped because of #34241. I confirmed that
it triggers the corresponding assertion on both AWS and GCE.

My request for more m5d.24xlarge quota just succeeded, but I may need to
request more quota for n1-highcpu-96 VMs for these to run nightly.

Release note: None

Co-authored-by: Lucy Zhang <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 20, 2019
…ents

Fixes cockroachdb#34819.

349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator.
This workaround created a second (non-time-bound) iterator to verify
possibly-phantom MVCCMetadata keys during iteration.

We found in cockroachdb#34819 that it is necessary for correctness that sanityIter
be created before the original iter. This is because the provided Reader
that both iterators are created from may not be a consistent snapshot, so
the two iterators could end up observing different information. The hack
around sanityCheckMetadataKey only works properly if all possible
discrepancies between the two iterators lead to intents and values falling
outside of the timestamp range **from the time-bound iterators perspective**.
This allows us to simply ignore discrepancies that we notice in advance().

This commit makes this change. It also adds a test that failed regularly
before the fix under stress and no longer fails after the fix.

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 26, 2019
Time-bound iterators are a performance optimization that allows us to
entirely skip over sstables in RocksDB that don't have data relevant to
the time bounds in a request.

This can have a dramatic impact on performance, but we've seen a number
of extremely subtle and hard to detect correctness issues with this (see
issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the
optimization everywhere that it isn't absolutely necessary for the
feature to work (leaving one place: poller-based changefeeds, which are
being phased out anyway). This will both give increased confidance in
correctness as well as eliminate any need to consider and investigate
time-bound iterators when/if someone hits a correctness bug.

This commit introduces the plumbing necessary for an individual
ExportRequest to control whether time-bound iterators are allowed or
disallowed.

A bool is introduced to the ExportRequest proto that explictly allows
time-bound iterators. This means that in a rolling upgrade, it's
possible for an old changefeed-poller node to send a request without the
field set to a new node, which sees the unset field as false, disabling
the optimization. An alternative is to invert the semantics of the bool
(zero-value = enable, true = disable the optimization), but in case any
new uses of ExportRequest are introduced, I'd much prefer the zero-value
of this field be the safer default of disabled.

As part of the investigation for whether we could turn them off for
incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of
time-bound iterator impact on incremental backup. I installed tpcc-1000
on a 3 node n1-standard-16 cluster, then ran a full backup, then ran
load for 1 hour, noting the time T. With load running, I ran 6
incremental backups from T alternating between tbi and no-tbi:

    tbi incremental backup runtimes: 3m45s,3m56s,4m6s
    no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s

Any impact on normal traffic (latencies etc) seemed to be in the noise.

Closes cockroachdb#35671.

Release note (enterprise change): In exchange for increased correctness
confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 28, 2019
Time-bound iterators are a performance optimization that allows us to
entirely skip over sstables in RocksDB that don't have data relevant to
the time bounds in a request.

This can have a dramatic impact on performance, but we've seen a number
of extremely subtle and hard to detect correctness issues with this (see
issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the
optimization everywhere that it isn't absolutely necessary for the
feature to work (leaving one place: poller-based changefeeds, which are
being phased out anyway). This will both give increased confidance in
correctness as well as eliminate any need to consider and investigate
time-bound iterators when/if someone hits a correctness bug.

This commit introduces the plumbing necessary for an individual
ExportRequest to control whether time-bound iterators are allowed or
disallowed.

A bool is introduced to the ExportRequest proto that explictly allows
time-bound iterators. This means that in a rolling upgrade, it's
possible for an old changefeed-poller node to send a request without the
field set to a new node, which sees the unset field as false, disabling
the optimization. An alternative is to invert the semantics of the bool
(zero-value = enable, true = disable the optimization), but in case any
new uses of ExportRequest are introduced, I'd much prefer the zero-value
of this field be the safer default of disabled.

As part of the investigation for whether we could turn them off for
incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of
time-bound iterator impact on incremental backup. I installed tpcc-1000
on a 3 node n1-standard-16 cluster, then ran a full backup, then ran
load for 1 hour, noting the time T. With load running, I ran 6
incremental backups from T alternating between tbi and no-tbi:

    tbi incremental backup runtimes: 3m45s,3m56s,4m6s
    no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s

Any impact on normal traffic (latencies etc) seemed to be in the noise.

Closes cockroachdb#35671.

Release note (enterprise change): In exchange for increased correctness
confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
craig bot pushed a commit that referenced this issue Mar 28, 2019
36191: storageccl: disable time-bound iteration optimization in BACKUP r=dt,tbg a=danhhz

Time-bound iterators are a performance optimization that allows us to
entirely skip over sstables in RocksDB that don't have data relevant to
the time bounds in a request.

This can have a dramatic impact on performance, but we've seen a number
of extremely subtle and hard to detect correctness issues with this (see
issues #28358 #34819). As a result, we've decided to skip the
optimization everywhere that it isn't absolutely necessary for the
feature to work (leaving one place: poller-based changefeeds, which are
being phased out anyway). This will both give increased confidance in
correctness as well as eliminate any need to consider and investigate
time-bound iterators when/if someone hits a correctness bug.

This commit introduces the plumbing necessary for an individual
ExportRequest to control whether time-bound iterators are allowed or
disallowed.

A bool is introduced to the ExportRequest proto that explictly allows
time-bound iterators. This means that in a rolling upgrade, it's
possible for an old changefeed-poller node to send a request without the
field set to a new node, which sees the unset field as false, disabling
the optimization. An alternative is to invert the semantics of the bool
(zero-value = enable, true = disable the optimization), but in case any
new uses of ExportRequest are introduced, I'd much prefer the zero-value
of this field be the safer default of disabled.

As part of the investigation for whether we could turn them off for
incremental BACKUP (#35671), I reran some of the initial measurements of
time-bound iterator impact on incremental backup. I installed tpcc-1000
on a 3 node n1-standard-16 cluster, then ran a full backup, then ran
load for 1 hour, noting the time T. With load running, I ran 6
incremental backups from T alternating between tbi and no-tbi:

    tbi incremental backup runtimes: 3m45s,3m56s,4m6s
    no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s

Any impact on normal traffic (latencies etc) seemed to be in the noise.

Closes #35671.

Release note (enterprise change): In exchange for increased correctness
confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.

Co-authored-by: Daniel Harrison <[email protected]>
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 28, 2019
Time-bound iterators are a performance optimization that allows us to
entirely skip over sstables in RocksDB that don't have data relevant to
the time bounds in a request.

This can have a dramatic impact on performance, but we've seen a number
of extremely subtle and hard to detect correctness issues with this (see
issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the
optimization everywhere that it isn't absolutely necessary for the
feature to work (leaving one place: poller-based changefeeds, which are
being phased out anyway). This will both give increased confidance in
correctness as well as eliminate any need to consider and investigate
time-bound iterators when/if someone hits a correctness bug.

This commit introduces the plumbing necessary for an individual
ExportRequest to control whether time-bound iterators are allowed or
disallowed.

A bool is introduced to the ExportRequest proto that explictly allows
time-bound iterators. This means that in a rolling upgrade, it's
possible for an old changefeed-poller node to send a request without the
field set to a new node, which sees the unset field as false, disabling
the optimization. An alternative is to invert the semantics of the bool
(zero-value = enable, true = disable the optimization), but in case any
new uses of ExportRequest are introduced, I'd much prefer the zero-value
of this field be the safer default of disabled.

As part of the investigation for whether we could turn them off for
incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of
time-bound iterator impact on incremental backup. I installed tpcc-1000
on a 3 node n1-standard-16 cluster, then ran a full backup, then ran
load for 1 hour, noting the time T. With load running, I ran 6
incremental backups from T alternating between tbi and no-tbi:

    tbi incremental backup runtimes: 3m45s,3m56s,4m6s
    no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s

Any impact on normal traffic (latencies etc) seemed to be in the noise.

Closes cockroachdb#35671.

Release note (enterprise change): In exchange for increased correctness
confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants