-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/changefeedccl: TestAvroEncoder failed under stress #34819
Comments
SHA: https://github.com/cockroachdb/cockroach/commits/10f8010fa5778e740c057905e2d7664b5fd5d647 Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1135873&tab=buildLog
|
Huh, guess this is real. I wonder what changed |
SHA: https://github.com/cockroachdb/cockroach/commits/22119c85d2e7aca50f21f61408ae55207679569d Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1140564&tab=buildLog
|
This reproduces locally under stressrace pretty easily. I turned on verbose changefeed logging and found something odd. I assumed the bug was in the avro encoder stuff reusing some memory that it shouldn't be but I think I'm seeing a kv get written with a different timestamp than the one returned by The test does var ts2 string
require.NoError(t, crdb.ExecuteTx(ctx, db, nil /* txopts */, func(tx *gosql.Tx) error {
return tx.QueryRow(
`INSERT INTO foo VALUES (3, 'baz') RETURNING cluster_logical_timestamp()`,
).Scan(&ts2)
}))
assertPayloadsAvro(t, reg, fooUpdated, []string{
`foo: {"a":{"long":3}}->{"after":{"foo":{"a":{"long":3},"b":{"string":"baz"}}},` +
`"updated":{"string":"` + ts2 + `"}}`,
}) With verbose changefeed logging on (
The "expected" in the error is ts2, which comes back through sql from |
|
Okay, did another repro with I'm still confused why I'm not seeing this with rangefeed, so maybe it has something to do with the ExportRequest implementation (which notably uses a time-bound iterator). I'll try making the test use a normal iterator to double-check what's coming back from ExportRequest. |
I don't know much past what you've already said -- we shouldn't be seeing a kv with any timestamp other than the one returned from If you can't reproduce this with rangefeed, then I think you're correct to start digging into the ExportRequest implementation. Is it possible that an ExportRequest is pushing an intent and causing the transaction to restart? It would be worth tracing (or just instrumenting) that |
The plot thickens:
Smells like time-bound iterator, bummer : - / |
We'll have to wait for more logging, but the fact that the sanity check iterator doesn't use a max timestamp hint in
but the time-bound iterator, where the provisional value is pulled from the time-bound iterator, does, seems very suspicious to me. I don't have a full story that explains what you're seeing here, but it certainly feels like that could cause issues. If you feel up to it, I'd try reproducing with a max timestamp hint in |
I just got the following repro with additional logging in 19d76f7ea0bdae258808de3937b82cc3c7d8ad42
Sorry the debug messages I added are so haphazard and oddly ordered, but I'll walk through what happens here. line 1: TBI hits an intent So at this point, there's a new provisional value (as seen by the sanity iter), which nathan assures me means the old provisional value will have been deleted. But the TBI is still seeing some previous one at an older timestamp. Provisional value deletions do have a timestamp, so they should be reflected in TBI sstable metadata, which means the TBI shouldn't see this intent. I'm not sure yet exactly what is breaking down. |
RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (cockroachdb#28358, cockroachdb#34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes cockroachdb#35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits.
35470: rangefeed: stop using time-bound iterator for catchup scan r=tbg a=danhhz RangeFeed originally intended to use the time-bound iterator performance optimization. However, they've had correctness issues in the past (#28358, #34819) and no-one has the time for the due-diligence necessary to be confidant in their correctness going forward. Not using them causes the total time spent in RangeFeed catchup on changefeed over tpcc-1000 to go from 40s -> 4853s, which is quite large but still workable. Closes #35122 Release note (enterprise change): In exchange for increased correctness confidance, `CHANGEFEED`s using `changefeed.push.enabled` (the default) now take slightly more resources on startup and range rebalancing/splits. Co-authored-by: Daniel Harrison <[email protected]>
Ok, I think I see what's going on. I'm running to a meeting so I'll write the explanation up later, but here's a hint - this fixes it:
|
The issue is that the The hack in So what do we do about this? I see a few options:
This will affect any user of |
Yeah this fixes it, but the reason why is pretty subtle. By creating the |
I'm still wrapping my head around this, but great find! I think I've been assuming that all iterators created from the Yes, this will affect all users of MVCCIncrementalIterator, which is currently both full and incremental backup as well as poller based changefeeds and the initial scan and tabledesc history of any changefeed. There is a bug that makes full backup and changefeed initial scan use tbis, but it's an easy fix. I've also just proposed that incremental backup also stop using tbis (see #35671). tabledesc history polling also shouldn't need tbis. Which leaves only poller based chagnefeeds. |
…ents Fixes cockroachdb#34819. 349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in cockroachdb#34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterators perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None
35300: sql: add checks after all referenced columns have been backfilled r=lucy-zhang a=lucy-zhang Previously, if a column was added with a check constraint that also referenced another column that was public, writes to that public column would erroneously fail (and, in the worst case, result in a panic) while the column being added was not yet public. With this change, the schema changer will now wait to add the check constraint to the table descriptor until after all the columns that were added in the same transaction have been backfilled. A new optional field has been added to `ConstraintToValidate` for the check constraint itself, so that it can be added to the table descriptor at the correct step in the schema change process. I ended up adding this field to the existing mutation instead of creating a new type of mutation to add the constraint to the descriptor, since it ultimately seemed to me that a mutation that simply adds a schema element in its backfill step would be too inconsistent with what mutations are, especially since all the mutations for a single transaction are basically executed at the same time. To support NOT VALID in the future, we could add more flags to the protobuf to indicate that either the addition of the constraint or the validation should be skipped, so that they can be executed separately. Fixes #35258, fixes #35193 Release note: None 35682: engineccl/mvcc: fix time-bound iterator's interaction with moving intents r=nvanbenschoten a=nvanbenschoten Fixes #34819. 349ff61 introduced a workaround for #28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in #34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterator's perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None 35689: roachtest: add large node kv tests and batching kv tests r=nvanbenschoten a=nvanbenschoten This commit adds support for running `kv` with a `--batch` parameter. It then adds the following new roachtest configurations: - kv0/enc=false/nodes=3/batch=16 - kv95/enc=false/nodes=3/batch=16 - kv0/enc=false/nodes=3/cpu=96 - kv95/enc=false/nodes=3/cpu=96 - kv50/enc=false/nodes=4/cpu=96/batch=64 The last test is currently skipped because of #34241. I confirmed that it triggers the corresponding assertion on both AWS and GCE. My request for more m5d.24xlarge quota just succeeded, but I may need to request more quota for n1-highcpu-96 VMs for these to run nightly. Release note: None Co-authored-by: Lucy Zhang <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
…ents Fixes cockroachdb#34819. 349ff61 introduced a workaround for cockroachdb#28358 into MVCCIncrementalIterator. This workaround created a second (non-time-bound) iterator to verify possibly-phantom MVCCMetadata keys during iteration. We found in cockroachdb#34819 that it is necessary for correctness that sanityIter be created before the original iter. This is because the provided Reader that both iterators are created from may not be a consistent snapshot, so the two iterators could end up observing different information. The hack around sanityCheckMetadataKey only works properly if all possible discrepancies between the two iterators lead to intents and values falling outside of the timestamp range **from the time-bound iterators perspective**. This allows us to simply ignore discrepancies that we notice in advance(). This commit makes this change. It also adds a test that failed regularly before the fix under stress and no longer fails after the fix. Release note: None
Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes cockroachdb#35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes cockroachdb#35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
36191: storageccl: disable time-bound iteration optimization in BACKUP r=dt,tbg a=danhhz Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues #28358 #34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes #35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer. Co-authored-by: Daniel Harrison <[email protected]>
Time-bound iterators are a performance optimization that allows us to entirely skip over sstables in RocksDB that don't have data relevant to the time bounds in a request. This can have a dramatic impact on performance, but we've seen a number of extremely subtle and hard to detect correctness issues with this (see issues cockroachdb#28358 cockroachdb#34819). As a result, we've decided to skip the optimization everywhere that it isn't absolutely necessary for the feature to work (leaving one place: poller-based changefeeds, which are being phased out anyway). This will both give increased confidance in correctness as well as eliminate any need to consider and investigate time-bound iterators when/if someone hits a correctness bug. This commit introduces the plumbing necessary for an individual ExportRequest to control whether time-bound iterators are allowed or disallowed. A bool is introduced to the ExportRequest proto that explictly allows time-bound iterators. This means that in a rolling upgrade, it's possible for an old changefeed-poller node to send a request without the field set to a new node, which sees the unset field as false, disabling the optimization. An alternative is to invert the semantics of the bool (zero-value = enable, true = disable the optimization), but in case any new uses of ExportRequest are introduced, I'd much prefer the zero-value of this field be the safer default of disabled. As part of the investigation for whether we could turn them off for incremental BACKUP (cockroachdb#35671), I reran some of the initial measurements of time-bound iterator impact on incremental backup. I installed tpcc-1000 on a 3 node n1-standard-16 cluster, then ran a full backup, then ran load for 1 hour, noting the time T. With load running, I ran 6 incremental backups from T alternating between tbi and no-tbi: tbi incremental backup runtimes: 3m45s,3m56s,4m6s no-tbi incremental backup runtimes: 6m45s,6m27s,6m48s Any impact on normal traffic (latencies etc) seemed to be in the noise. Closes cockroachdb#35671. Release note (enterprise change): In exchange for increased correctness confidance, `BACKUP`s using `INCREMENTAL FROM` now take slightly longer.
SHA: https://github.com/cockroachdb/cockroach/commits/965525a5deb3c4fab5ab232e069846ecc08f632d
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1133737&tab=buildLog
The text was updated successfully, but these errors were encountered: