kv: effectively eliminate "intent missing and record aborted" ambiguous errors #52566

nvanbenschoten · 2020-08-10T14:35:31Z

Address the following TODO:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1027 to 1034 in fdf54eb

    
           // TODO(nvanbenschoten): QueryTxn will materialize an ABORTED transaction 
        
           // record if one does not already exist. If we are certain that no actor 
        
           // will ever persist an ABORTED transaction record after a COMMIT record is 
        
           // GCed and we returned whether the record was synthesized in the QueryTxn 
        
           // response then we could use the existence of an ABORTED transaction record 
        
           // to further isolates the ambiguity caused by the loss of information 
        
           // during intent resolution. If this error becomes a problem, we can explore 
        
           // this option.

While here, we'll also want to make sure that only a transaction's own coordinator will ever GC its record below the TxnCleanupThreshold (1 hour). This seems to already be the case everywhere except for in the rangefeedTxnPusher, which both pushes transactions and GCs their record. We should stop doing the second part.

The text was updated successfully, but these errors were encountered:

Before this patch, the rangefeed txn pushed was GC'ing the records of the transactions that it found to be aborted or committed when it PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do if the txn's coordinator might still be running, and so this patch terminates the practice. If the coordinator is still running and it had STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED or COMMITTED. If such a transition is followed by the GC of the txn record, this causes an ambigous commit error for the coordinator, which can't tell whether the transaction committed (implicitly) or was rolled back. As the code stands, this patch doesn't actually help eliminate the ambiguity in the ABORTED case (it does in the COMMIT case though), because the coordinator doesn't distinguish between an ABORTED record and a missing record; it treats both the same, as ambiguous. But we're going to improve that, and the coordinator is gonna consider an existing record as non-ambiguous. The GC'ed case is going to remain the ambiguous one, and the idea is to avoid that as much as possible by only doing GC from the gc-queue (after an hour) and from the coordinator. Touches cockroachdb#52566 Release note (bug fix): Some rare AmbiguousCommitErrors happening when CDC was used were eliminated.

This patch eliminates an AmbiguousResultError case. When a committer finds an intent to be missing after having STAGED the txn, it tries to figure out if the transaction has been committed or aborted behind its back. It could be either: another actor might have performed recovery and found the txn to be implicitly committed, or the recovery might succeed in preventing an intent from being written and thus the transaction will be aborted. The commiter thus reads the txn record and tries to figure out what the deal is. The case in which a COMMITed txn record is found is unambiguous. The case when the txn record is not found is ambiguous: it could have been GCed, and the status is lost. This case stays ambiguous. The case where the transaction record is found to be ABORTED used to be treated as ambiguous, but this patch makes it unambiguous. It used to be ambiguous for a bad reason: the QueryTxn request used to read the txn record conflated an ABORTED record with a missing record that cannot be created anymore because of the tscache protections. This patch adds more output to QueryTxn, specifying if the record has been found or if it has been synthesized. This allows the DistSender to eliminate the ambiguity when the record is found. This patch relies on the fact that nobody writes an ABORTED record (or, any record) after the record is GC'ed. Fixes cockroachdb#52566 Fixes cockroachdb#52186 One of the issues above is a TPCC failure with an ambiguous error. I'm not sure that this was the ambiguity in question, but it seems likely. Release note (general change): Some cases where AmbiguousResultErrors were produced under high load now return non-ambiguous, retriable errors.

Before this patch, the rangefeed txn pushed was GC'ing the records of the transactions that it found to be aborted or committed when it PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do if the txn's coordinator might still be running, and so this patch terminates the practice. If the coordinator is still running and it had STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED or COMMITTED. If such a transition is followed by the GC of the txn record, this causes an ambigous commit error for the coordinator, which can't tell whether the transaction committed (implicitly) or was rolled back. As the code stands, this patch doesn't actually help eliminate the ambiguity in the ABORTED case (it does in the COMMIT case though), because the coordinator doesn't distinguish between an ABORTED record and a missing record; it treats both the same, as ambiguous. But we're going to improve that, and the coordinator is gonna consider an existing record as non-ambiguous. The GC'ed case is going to remain the ambiguous one, and the idea is to avoid that as much as possible by only doing GC from the gc-queue (after an hour) and from the coordinator. Touches cockroachdb#52566 Release note (bug fix): Some rare AmbiguousCommitErrors happening when CDC was used were eliminated.

52938: sql: implement idle_in_transaction_session_timeout r=RichardJCai a=RichardJCai Release note (sql change): Implement IdleInTransactionSessionTimeout variable to allow terminating sessions that are idle in a transaction past the provides threshold. Set the variable by using SET idle_in_transaction_session_timeout = 'time'. Sessions that are idle in OPEN, ABORTED and DONE(COMMITWAIT) transaction states will be terminated if the user idles longer than the threshold time. Fixes #5924 53041: cli/debug: add SQL-decoding to printed keys r=dt a=dt This adds a `--decode-as-table` flag to `debug keys --values`. When passed a base64 encoded Descriptor, which can be obtained via `select encode(descriptor, 'base64') from system.descriptor where id = x`, it will attempt to decode KVs using a SQL row-fetcher backed by that descriptor and print them as col1=val1, col2=val2,... Release note: none. 53060: backupccl: refactor backup destination resolution r=dt a=pbardea During backup planning we need to resolve the backup destination. This means that we need to: 1. Determine if we're going to resolve to a backup collection or not 2. Determine if we should auto-append 3. Read any encryption options we may need if we determine that we're donig an incremental backup 4. Find the previous backups in the chain if we're doing an incremental backup. Release note: None 53066: opt: add rule to replace ST_Distance with ST_DWithin r=DrewKimball a=DrewKimball `ST_DWithin` is equivalent to the expression `ST_Distance <= x`. Similar reformulations can be made for different comparison operators (<, >, >=). This PR consists of two commits. The first commit adds an internal builtin function `ST_DWithinExclusive`, which is equivalent to `ST_Distance < x` but is able to exit early. The second commit adds two rules that replace expressions of the form `ST_Distance <= x` with either `ST_DWithin` or `ST_DWithinExclusive` depending on the comparison operator used. This replacement is desirable because the `ST_DWithin` function can exit early, while `ST_Distance` cannot do the same. `ST_DWithin` can also be used to generate an inverted index scan. Fixes #52028 Release note: None 53085: colexec: fix dynamic batch behavior in hash joiner with low mem limit r=yuzefovich a=yuzefovich The disk-spilling logic in the hash joiner relies on the assumption that a memory limit error can only occur during the building of the hash table. However, this assumption was recently broken in the case of the low `workmem` limit setting by the dynamic batch size work - it now became possible for an error to occur during output population. As a result, we might have already consumed some batches from the left input and have emitted partial results before we fall back to the external hash joiner, and this would result in some tuples not being emitted at all (if we have a pending batch from the left). This problem is now fixed by restoring the invariant with the introduction of an unlimited allocator into the in-memory hash joiner. The unlimited allocator is used only to populate the output (meaning that we have already successfully built the hash table from the right side) at the phase of the hash join algorithm when we consume the left side one batch at a time and populate the output also one batch at a time, so such choice is acceptable. This additionally allows to properly account for the memory used by the output batch. Note that the only other disk-spilling operator (different sorters that fallback to the external sort) weren't impacted by this because they are able to correctly export the buffered tuples regardless of when the memory limit error occurs. Fixes: #52859. Release note: None 53146: rangefeed: don't delete txn records from the rngfeed pusher r=andreimatei a=andreimatei Before this patch, the rangefeed txn pushed was GC'ing the records of the transactions that it found to be aborted or committed when it PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do if the txn's coordinator might still be running, and so this patch terminates the practice. If the coordinator is still running and it had STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED or COMMITTED. If such a transition is followed by the GC of the txn record, this causes an ambigous commit error for the coordinator, which can't tell whether the transaction committed (implicitly) or was rolled back. As the code stands, this patch doesn't actually help eliminate the ambiguity in the ABORTED case (it does in the COMMIT case though), because the coordinator doesn't distinguish between an ABORTED record and a missing record; it treats both the same, as ambiguous. But we're going to improve that, and the coordinator is gonna consider an existing record as non-ambiguous. The GC'ed case is going to remain the ambiguous one, and the idea is to avoid that as much as possible by only doing GC from the gc-queue (after an hour) and from the coordinator. Touches #52566 Release note (bug fix): Some rare AmbiguousCommitErrors happening when CDC was used were eliminated. Co-authored-by: richardjcai <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Paul Bardea <[email protected]> Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrei Matei <[email protected]>

52704: sql: add support for transaction level stats collection r=solongordon a=arulajmani This patch adds the framework to start collecting transaction level statistics and return them on the `statements` end point. Previously, the response on the end point was being faked. This patch replaces the fake response by an actual response with some fields correctly populated. Other fields will be populated in upcoming patches, and currently contain garbage data. The correctly populated fields on `CollectedTransactionStatistics` are: 1. StatementIDs 2. Count 3. ServiceLat 4. App 5. RetryLat 6. MaxRetries 7. CommitLat Release note: None 53156: kvserver,kvclient: disambiguate aborted txn case r=andreimatei a=andreimatei This patch eliminates an AmbiguousResultError case. When a committer finds an intent to be missing after having STAGED the txn, it tries to figure out if the transaction has been committed or aborted behind its back. It could be either: another actor might have performed recovery and found the txn to be implicitly committed, or the recovery might succeed in preventing an intent from being written and thus the transaction will be aborted. The commiter thus reads the txn record and tries to figure out what the deal is. The case in which a COMMITed txn record is found is unambiguous. The case when the txn record is not found is ambiguous: it could have been GCed, and the status is lost. This case stays ambiguous. The case where the transaction record is found to be ABORTED used to be treated as ambiguous, but this patch makes it unambiguous. It used to be ambiguous for a bad reason: the QueryTxn request used to read the txn record conflated an ABORTED record with a missing record that cannot be created anymore because of the tscache protections. This patch adds more output to QueryTxn, specifying if the record has been found or if it has been synthesized. This allows the DistSender to eliminate the ambiguity when the record is found. This patch relies on the fact that nobody writes an ABORTED record (or, any record) after the record is GC'ed. Fixes #52566 Fixes #52186 One of the issues above is a TPCC failure with an ambiguous error. I'm not sure that this was the ambiguity in question, but it seems likely. Release note (general change): Some cases where AmbiguousResultErrors were produced under high load now return non-ambiguous, retriable errors. 53157: build: add libgeos libs to docker image for publish release r=pbardea a=jlinder Before: The script was written to be backportable (without adding the libgeos libs to the docker image). Why: Geospatial arrived in master (20.2) and the libgeos libs need to be included with the docker image. Now: The libs are included in the docker image generated in the teamcity-publish-release.sh script. Release note: None Co-authored-by: Solon Gordon <[email protected]> Co-authored-by: arulajmani <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: James H. Linder <[email protected]>

Fixes cockroachdb#53189. Fixes cockroachdb#53282. Fixes cockroachdb#53285. Fixes cockroachdb#53469. 3dcb6f1 improved the detection of missing intents during parallel commit attempts to distinguish between certain classes of ambiguous errors and transaction aborted errors. This was a nice improvement, as it dramatically reduced the number of situations where we returned ambiguous errors during normal operation (see cockroachdb#52566). However, in introducing a new location where transaction retry errors could be generated, it accidentally violated the invariant that all transaction retry errors have transaction protos attached to them. This was causing panics in TPC-C roachtests. This commit fixes this issue by properly attaching transaction protos to these new errors, along with any others returned from `detectIntentMissingDueToIntentResolution`. Release justification: bug fix

53589: jobs: improve job adoption r=ajwerner a=ajwerner #### jobs: don't hold mutex during adoption, launch in parallel #### jobs: break up new stages of job lifecycle movement In the PR which adopted the sqlliveness sessions, we shoved all of the stages of adopting jobs into the same stage and we invoked that stage on each adoption interval and on each sent to the adoption channel. These stages are: * Cancel jobs * Serve pause and cancel requests * Delete claims due to dead sessions * Claim jobs * Process claimed jobs This is problematic for tests which send on the adoption channel at a high rate. One important thing to note is that all jobs which are sent on the adoption channel are already claimed. After this PR we move the first three steps above into the cancellation loop we were already running. We also increase the default interval for that loop as it was exceedingly frequent at 1s for no obvious reason. Much of the testing changes are due to this cancelation loop duration change. The tests in this package now run 3x faster (10s vs 30s). Then, upon sends on the adoption channel, we just process claimed jobs. When the adoption interval rolls around, then we attempt to both claim and process jobs. Release justification: bug fixes and low-risk updates to new functionality Release note: None 53697: kv: attach txn to error from detectIntentMissingDueToIntentResolution r=nvanbenschoten a=nvanbenschoten Fixes #53189. Fixes #53282. Fixes #53285. Fixes #53469. 3dcb6f1 improved the detection of missing intents during parallel commit attempts to distinguish between certain classes of ambiguous errors and transaction aborted errors. This was a nice improvement, as it dramatically reduced the number of situations where we returned ambiguous errors during normal operation (see #52566). However, in introducing a new location where transaction retry errors could be generated, it accidentally violated the invariant that all transaction retry errors have transaction protos attached to them. This was causing panics in TPC-C roachtests. This commit fixes this issue by properly attaching transaction protos to these new errors, along with any others returned from `detectIntentMissingDueToIntentResolution`. Release justification: bug fix 53708: builtins: implement ST_MemCollect and ST_MemUnion r=otan a=erikgrinaker These are implemented as aliases for ST_Collect and ST_Union. In PostGIS these are memory-optimized versions, but for now it should be sufficient to simply make them functional. Release justification: low risk, high benefit changes to existing functionality Release note (sql change): Implement the geometry aggregate builtins `ST_MemCollect` and `ST_MemUnion`. Closes #48984. Closes #48986. Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

nvanbenschoten assigned andreimatei Aug 10, 2020

This was referenced Aug 10, 2020

workload: should tolerate 40003 ambiguous error without --tolerate-errors #49865

Closed

workload: tolerate ambiguous commit errors with --tolerate-ambiguous-errors #52555

Closed

andreimatei mentioned this issue Aug 20, 2020

rangefeed: don't delete txn records from the rngfeed pusher #53146

Merged

andreimatei mentioned this issue Aug 20, 2020

kvserver,kvclient: disambiguate aborted txn case #53156

Merged

craig bot closed this as completed in 3dcb6f1 Aug 21, 2020

nvanbenschoten mentioned this issue Aug 31, 2020

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #53469

Closed

nvanbenschoten mentioned this issue Aug 31, 2020

kv: attach txn to error from detectIntentMissingDueToIntentResolution #53697

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: effectively eliminate "intent missing and record aborted" ambiguous errors #52566

kv: effectively eliminate "intent missing and record aborted" ambiguous errors #52566

nvanbenschoten commented Aug 10, 2020

kv: effectively eliminate "intent missing and record aborted" ambiguous errors #52566

kv: effectively eliminate "intent missing and record aborted" ambiguous errors #52566

Comments

nvanbenschoten commented Aug 10, 2020