-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libroach: timebound iterator metadata doesn't consider intents #28358
Comments
I don't think this exact scenario is possible. If the time bound iterator accidentally excludes a lingering intent from an aborted transaction, it's no problem, as the backup wasn't supposed to contain that intent anyway. The problem, as far as I can see, arises when the iterator excludes a lingering intent from a committed transaction; that intent needed to be promoted to a write, but the backup will not include that write. |
When an intent is laid down, we write two keys: the MVCCMetadata key at ts=0 that indicates that an intent exists and the provisional value at ts=value_ts that holds the intent's value. When we commit an intent, we actually just delete the MVCCMetadata key, effectively "revealing" the provisional value as committed. I think the issue here is that if an intent's MVCCMetadata key is split in a different sst from its provisional value key, it's possible that a timebound iterator scans only over the sst with the provisional value key. In that case, the value will look committed even though it's not! I think the hazard arises when the intent is either aborted or pushed before being committed. If this happens, the backup would contain a value that was never committed or was committed at a later timestamp. |
Gah! My understanding was still subtly wrong. Thanks for the clarification, @nvanbenschoten. There's now two other solutions, though I'm not sure how well supported by RocksDB they are. One is to simply disallow splitting a metadata key from its actual values. The other is to only look at the last key in an SST. If it's a metadata key, then we need to look in and see if there's an intent. Only in that case do we need to erase the timestamp bounds on the SST. |
Also, no wonder we haven't seen this in practice. |
I'm upgrading this to S-1 to hopefully get more visibility. I think this needs to get fixed for v2.1. We haven't seen this in practice, but that's the nature of this bug: you would never notice when it happened. @petermattis putting this on your radar. Is there anyone who can take a look at this? |
This isn't possible as far as I'm aware.
Are you thinking this would be done at compaction time?
Hrmm, the number of folks familiar with the C++ code is too small. |
Yeah, at SST build time. I'm hoping there's some way to do this with the table properties collector interface. |
Yeah every time you see an intent, copy the serialized proto to a field in the table prop collector. Every time you see a non intent, clear it. Then when the finalize method is called, if the field is non empty, deserialize the proto and process the timestamp |
I can take a crack at doing this, but may require some assistance in setting up a test scenario which fails. Sounds like what we want is a situation where there are two sstables where the first contains the MVCC metadata key and the second contains the provisional key. I'm thinking I can create that scenario using ingested sstables. Then I perform a backup and the sstable which contains the provisional key should be included in the backup incorrectly, right? |
🎉
Yes, that's my understanding. The SST with the metadata key (SSTm) will need to have time bounds that are older than the SST with the provisional key. Then the backup needs to be incremental on top of a full backup that occurred after the timestamp of any key in SSTm. I think. Getting such a full backup is going to be tricky. I think it'll be easier if you just create an MVCCIncrementalIterator over a store with the desired SSTs and make sure that you don't see the provisional key. |
Backups never include provisional keys, they retry until it's resolved one way or the other. I think the high-level issue here is that the timebound iterator would incorrectly skip the sstable, and the backup wouldn't know it had to wait for it to be resolved. |
Oh, I misunderstood what you meant by provisional key. What @benesch said. |
So perhaps I should do this fully within |
Yeah, I think it's fine (and much simpler) to test at the storage/engine level. You found it: cockroach/pkg/ccl/storageccl/export.go Lines 164 to 200 in 62c976d
|
…bles An intent which straddles an sstable can lead an incremental iterator to incorrectly ignore an sstable. In order to fix this, when an intent straddles an sstable (i.e. the metadata key is the last key in the sstable) we need to include the intent's timestamp in the timestamp bounds. We don't need to do this for interior intents because we'll already be including the intent's timestamp as it is contained in the next key following the intent. Add `TestMVCCIncrementalIteratorIntentStraddlesSStables` which demonstrates the problem. Fixes cockroachdb#28358 Release note (bug fix): Fix a rare scenario where a backup could incorrectly include a key for a transaction which was aborted.
…bles An intent which straddles an sstable can lead an incremental iterator to incorrectly ignore an sstable. In order to fix this, when an intent straddles an sstable (i.e. the metadata key is the last key in the sstable) we need to include the intent's timestamp in the timestamp bounds. We don't need to do this for interior intents because we'll already be including the intent's timestamp as it is contained in the next key following the intent. Add `TestMVCCIncrementalIteratorIntentStraddlesSStables` which demonstrates the problem. Fixes cockroachdb#28358 Release note (bug fix): Fix a rare scenario where a backup could incorrectly include a key for a transaction which was aborted.
That's quite the understatement. |
For reasons described in cockroachdb#28358, a time-bound iterator will sometimes see an unresolved intent where there is none. A normal iterator doesn't have this problem, so we work around it in MVCCIncrementalIterator by double checking all non-value keys. If the normal iterator has a different value for the key, it's used instead. If the normal iterator doesn't have that key, it's skipped. This fixes both changefeeds and incremental backup. Closes cockroachdb#32104 Closes cockroachdb#32799 Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no longer indefinitely hang under an infrequent condition.
For reasons described in cockroachdb#28358, a time-bound iterator will sometimes see an unresolved intent where there is none. A normal iterator doesn't have this problem, so we work around it in MVCCIncrementalIterator by double checking all non-value keys. If the normal iterator has a different value for the key, it's used instead. If the normal iterator doesn't have that key, it's skipped. This fixes both changefeeds and incremental backup. Closes cockroachdb#32104 Closes cockroachdb#32799 Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no longer indefinitely hang under an infrequent condition.
32909: engineccl/mvcc: work around time-bound iterator bug r=bdarnell,benesch a=danhhz For reasons described in #28358, a time-bound iterator will sometimes see an unresolved intent where there is none. A normal iterator doesn't have this problem, so we work around it in MVCCIncrementalIterator by double checking all non-value keys. If the normal iterator has a different value for the key, it's used instead. If the normal iterator doesn't have that key, it's skipped. This fixes both changefeeds and incremental backup. Closes #32104 Closes #32799 Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no longer indefinitely hang under an infrequent condition. Co-authored-by: Daniel Harrison <[email protected]>
For reasons described in cockroachdb#28358, a time-bound iterator will sometimes see an unresolved intent where there is none. A normal iterator doesn't have this problem, so we work around it in MVCCIncrementalIterator by double checking all non-value keys. If the normal iterator has a different value for the key, it's used instead. If the normal iterator doesn't have that key, it's skipped. This fixes both changefeeds and incremental backup. Closes cockroachdb#32104 Closes cockroachdb#32799 Release note (bug fix): `CHANGEFEED`s and incremental `BACKUP`s no longer indefinitely hang under an infrequent condition.
This is a prototype of one of the proposed solutions to cockroachdb#28358, whereby we never delete MVCCMetadata keys when resolving an intent. Instead, we Put a new value that where Timestamp != nil (this is how the SST gets the right bounds), Txn == nil (this indicates that this is the resolution of an intent and not a new intent), and RawBytes == nil (to indicate that this is not an inline value). This has the downside of a) being extremely subtle, and b) wasting space because writing a key transactionally now results in the metadata being present forever. (We could probably work around b with a compaction filter.) Touches cockroachdb#28358. Release note: None
I think it's fair to say this was re-closed by the workaround merged in #32909 |
RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (cockroachdb#28358, cockroachdb#34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes cockroachdb#35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits.
35470: rangefeed: stop using time-bound iterator for catchup scan r=tbg a=danhhz RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (#28358, #34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes #35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits. Co-authored-by: Daniel Harrison <[email protected]>
…ents Fixes cockroachdb#34819. 349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in cockroachdb#34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterators perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None
35300: sql: add checks after all referenced columns have been backfilled r=lucy-zhang a=lucy-zhang Previously, if a column was added with a check constraint that also referenced another column that was public, writes to that public column would erroneously fail (and, in the worst case, result in a panic) while the column being added was not yet public. With this change, the schema changer will now wait to add the check constraint to the table descriptor until after all the columns that were added in the same transaction have been backfilled. A new optional field has been added to `ConstraintToValidate` for the check constraint itself, so that it can be added to the table descriptor at the correct step in the schema change process. I ended up adding this field to the existing mutation instead of creating a new type of mutation to add the constraint to the descriptor, since it ultimately seemed to me that a mutation that simply adds a schema element in its backfill step would be too inconsistent with what mutations are, especially since all the mutations for a single transaction are basically executed at the same time. To support NOT VALID in the future, we could add more flags to the protobuf to indicate that either the addition of the constraint or the validation should be skipped, so that they can be executed separately. Fixes #35258, fixes #35193 Release note: None 35682: engineccl/mvcc: fix time-bound iterator's interaction with moving intents r=nvanbenschoten a=nvanbenschoten Fixes #34819. 349ff61 introduced a workaround for #28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in #34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterator's perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None 35689: roachtest: add large node kv tests and batching kv tests r=nvanbenschoten a=nvanbenschoten This commit adds support for running `kv` with a `--batch` parameter. It then adds the following new roachtest configurations: - kv0/enc=false/nodes=3/batch=16 - kv95/enc=false/nodes=3/batch=16 - kv0/enc=false/nodes=3/cpu=96 - kv95/enc=false/nodes=3/cpu=96 - kv50/enc=false/nodes=4/cpu=96/batch=64 The last test is currently skipped because of #34241. I confirmed that it triggers the corresponding assertion on both AWS and GCE. My request for more m5d.24xlarge quota just succeeded, but I may need to request more quota for n1-highcpu-96 VMs for these to run nightly. Release note: None Co-authored-by: Lucy Zhang <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
…ents Fixes cockroachdb#34819. 349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in cockroachdb#34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterators perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None
Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes cockroachdb#35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes cockroachdb#35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
36191: storageccl: disable time-bound iteration optimization in BACKUP r=dt,tbg a=danhhz Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues #28358 #34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes #35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer. Co-authored-by: Daniel Harrison <[email protected]>
Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes cockroachdb#35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
Note: I haven't confirmed this behavior in practice, it's currently theoretical from reading code. Holding off on the severity label until it's confirmed. The fix will need to be backported into 2.0
When computing which timestamps are present in an sstable for the time bound iterator performance optimization, all keys without a timestamp are ignored, which includes intents. This means that certain (probably unlikely) compactions will result in a situation where a time bounded iterator reading from
[t1,t2]
will miss an intent at t2.cockroach/c-deps/libroach/timebound.cc
Lines 38 to 53 in 8b684ff
Backup uses timebounded iterators and does a read at a certain timestamp until all intents below that timestamp have been resolved one way or another. This bug means that backup could incorrectly proceed with an unresolved intent. If this intent is later rolled back, the backup would contain a value that was never committed.
CDC will have similar correctness issues.
Several fixes are possible with various performance tradeoffs. An partial list:
The text was updated successfully, but these errors were encountered: