-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: effectively eliminate "intent missing and record aborted" ambiguous errors #52566
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
S-3-ux-surprise
Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Comments
nvanbenschoten
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
S-3-ux-surprise
Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
A-kv-transactions
Relating to MVCC and the transactional model.
labels
Aug 10, 2020
This was referenced Aug 10, 2020
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Aug 20, 2020
Before this patch, the rangefeed txn pushed was GC'ing the records of the transactions that it found to be aborted or committed when it PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do if the txn's coordinator might still be running, and so this patch terminates the practice. If the coordinator is still running and it had STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED or COMMITTED. If such a transition is followed by the GC of the txn record, this causes an ambigous commit error for the coordinator, which can't tell whether the transaction committed (implicitly) or was rolled back. As the code stands, this patch doesn't actually help eliminate the ambiguity in the ABORTED case (it does in the COMMIT case though), because the coordinator doesn't distinguish between an ABORTED record and a missing record; it treats both the same, as ambiguous. But we're going to improve that, and the coordinator is gonna consider an existing record as non-ambiguous. The GC'ed case is going to remain the ambiguous one, and the idea is to avoid that as much as possible by only doing GC from the gc-queue (after an hour) and from the coordinator. Touches cockroachdb#52566 Release note (bug fix): Some rare AmbiguousCommitErrors happening when CDC was used were eliminated.
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Aug 20, 2020
Before this patch, the rangefeed txn pushed was GC'ing the records of the transactions that it found to be aborted or committed when it PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do if the txn's coordinator might still be running, and so this patch terminates the practice. If the coordinator is still running and it had STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED or COMMITTED. If such a transition is followed by the GC of the txn record, this causes an ambigous commit error for the coordinator, which can't tell whether the transaction committed (implicitly) or was rolled back. As the code stands, this patch doesn't actually help eliminate the ambiguity in the ABORTED case (it does in the COMMIT case though), because the coordinator doesn't distinguish between an ABORTED record and a missing record; it treats both the same, as ambiguous. But we're going to improve that, and the coordinator is gonna consider an existing record as non-ambiguous. The GC'ed case is going to remain the ambiguous one, and the idea is to avoid that as much as possible by only doing GC from the gc-queue (after an hour) and from the coordinator. Touches cockroachdb#52566 Release note (bug fix): Some rare AmbiguousCommitErrors happening when CDC was used were eliminated.
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Aug 20, 2020
This patch eliminates an AmbiguousResultError case. When a committer finds an intent to be missing after having STAGED the txn, it tries to figure out if the transaction has been committed or aborted behind its back. It could be either: another actor might have performed recovery and found the txn to be implicitly committed, or the recovery might succeed in preventing an intent from being written and thus the transaction will be aborted. The commiter thus reads the txn record and tries to figure out what the deal is. The case in which a COMMITed txn record is found is unambiguous. The case when the txn record is not found is ambiguous: it could have been GCed, and the status is lost. This case stays ambiguous. The case where the transaction record is found to be ABORTED used to be treated as ambiguous, but this patch makes it unambiguous. It used to be ambiguous for a bad reason: the QueryTxn request used to read the txn record conflated an ABORTED record with a missing record that cannot be created anymore because of the tscache protections. This patch adds more output to QueryTxn, specifying if the record has been found or if it has been synthesized. This allows the DistSender to eliminate the ambiguity when the record is found. This patch relies on the fact that nobody writes an ABORTED record (or, any record) after the record is GC'ed. Fixes cockroachdb#52566 Fixes cockroachdb#52186 One of the issues above is a TPCC failure with an ambiguous error. I'm not sure that this was the ambiguity in question, but it seems likely. Release note (general change): Some cases where AmbiguousResultErrors were produced under high load now return non-ambiguous, retriable errors.
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Aug 20, 2020
Before this patch, the rangefeed txn pushed was GC'ing the records of the transactions that it found to be aborted or committed when it PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do if the txn's coordinator might still be running, and so this patch terminates the practice. If the coordinator is still running and it had STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED or COMMITTED. If such a transition is followed by the GC of the txn record, this causes an ambigous commit error for the coordinator, which can't tell whether the transaction committed (implicitly) or was rolled back. As the code stands, this patch doesn't actually help eliminate the ambiguity in the ABORTED case (it does in the COMMIT case though), because the coordinator doesn't distinguish between an ABORTED record and a missing record; it treats both the same, as ambiguous. But we're going to improve that, and the coordinator is gonna consider an existing record as non-ambiguous. The GC'ed case is going to remain the ambiguous one, and the idea is to avoid that as much as possible by only doing GC from the gc-queue (after an hour) and from the coordinator. Touches cockroachdb#52566 Release note (bug fix): Some rare AmbiguousCommitErrors happening when CDC was used were eliminated.
craig bot
pushed a commit
that referenced
this issue
Aug 20, 2020
52938: sql: implement idle_in_transaction_session_timeout r=RichardJCai a=RichardJCai Release note (sql change): Implement IdleInTransactionSessionTimeout variable to allow terminating sessions that are idle in a transaction past the provides threshold. Set the variable by using SET idle_in_transaction_session_timeout = 'time'. Sessions that are idle in OPEN, ABORTED and DONE(COMMITWAIT) transaction states will be terminated if the user idles longer than the threshold time. Fixes #5924 53041: cli/debug: add SQL-decoding to printed keys r=dt a=dt This adds a `--decode-as-table` flag to `debug keys --values`. When passed a base64 encoded Descriptor, which can be obtained via `select encode(descriptor, 'base64') from system.descriptor where id = x`, it will attempt to decode KVs using a SQL row-fetcher backed by that descriptor and print them as col1=val1, col2=val2,... Release note: none. 53060: backupccl: refactor backup destination resolution r=dt a=pbardea During backup planning we need to resolve the backup destination. This means that we need to: 1. Determine if we're going to resolve to a backup collection or not 2. Determine if we should auto-append 3. Read any encryption options we may need if we determine that we're donig an incremental backup 4. Find the previous backups in the chain if we're doing an incremental backup. Release note: None 53066: opt: add rule to replace ST_Distance with ST_DWithin r=DrewKimball a=DrewKimball `ST_DWithin` is equivalent to the expression `ST_Distance <= x`. Similar reformulations can be made for different comparison operators (<, >, >=). This PR consists of two commits. The first commit adds an internal builtin function `ST_DWithinExclusive`, which is equivalent to `ST_Distance < x` but is able to exit early. The second commit adds two rules that replace expressions of the form `ST_Distance <= x` with either `ST_DWithin` or `ST_DWithinExclusive` depending on the comparison operator used. This replacement is desirable because the `ST_DWithin` function can exit early, while `ST_Distance` cannot do the same. `ST_DWithin` can also be used to generate an inverted index scan. Fixes #52028 Release note: None 53085: colexec: fix dynamic batch behavior in hash joiner with low mem limit r=yuzefovich a=yuzefovich The disk-spilling logic in the hash joiner relies on the assumption that a memory limit error can only occur during the building of the hash table. However, this assumption was recently broken in the case of the low `workmem` limit setting by the dynamic batch size work - it now became possible for an error to occur during output population. As a result, we might have already consumed some batches from the left input and have emitted partial results before we fall back to the external hash joiner, and this would result in some tuples not being emitted at all (if we have a pending batch from the left). This problem is now fixed by restoring the invariant with the introduction of an unlimited allocator into the in-memory hash joiner. The unlimited allocator is used only to populate the output (meaning that we have already successfully built the hash table from the right side) at the phase of the hash join algorithm when we consume the left side one batch at a time and populate the output also one batch at a time, so such choice is acceptable. This additionally allows to properly account for the memory used by the output batch. Note that the only other disk-spilling operator (different sorters that fallback to the external sort) weren't impacted by this because they are able to correctly export the buffered tuples regardless of when the memory limit error occurs. Fixes: #52859. Release note: None 53146: rangefeed: don't delete txn records from the rngfeed pusher r=andreimatei a=andreimatei Before this patch, the rangefeed txn pushed was GC'ing the records of the transactions that it found to be aborted or committed when it PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do if the txn's coordinator might still be running, and so this patch terminates the practice. If the coordinator is still running and it had STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED or COMMITTED. If such a transition is followed by the GC of the txn record, this causes an ambigous commit error for the coordinator, which can't tell whether the transaction committed (implicitly) or was rolled back. As the code stands, this patch doesn't actually help eliminate the ambiguity in the ABORTED case (it does in the COMMIT case though), because the coordinator doesn't distinguish between an ABORTED record and a missing record; it treats both the same, as ambiguous. But we're going to improve that, and the coordinator is gonna consider an existing record as non-ambiguous. The GC'ed case is going to remain the ambiguous one, and the idea is to avoid that as much as possible by only doing GC from the gc-queue (after an hour) and from the coordinator. Touches #52566 Release note (bug fix): Some rare AmbiguousCommitErrors happening when CDC was used were eliminated. Co-authored-by: richardjcai <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Paul Bardea <[email protected]> Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrei Matei <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Aug 21, 2020
52704: sql: add support for transaction level stats collection r=solongordon a=arulajmani This patch adds the framework to start collecting transaction level statistics and return them on the `statements` end point. Previously, the response on the end point was being faked. This patch replaces the fake response by an actual response with some fields correctly populated. Other fields will be populated in upcoming patches, and currently contain garbage data. The correctly populated fields on `CollectedTransactionStatistics` are: 1. StatementIDs 2. Count 3. ServiceLat 4. App 5. RetryLat 6. MaxRetries 7. CommitLat Release note: None 53156: kvserver,kvclient: disambiguate aborted txn case r=andreimatei a=andreimatei This patch eliminates an AmbiguousResultError case. When a committer finds an intent to be missing after having STAGED the txn, it tries to figure out if the transaction has been committed or aborted behind its back. It could be either: another actor might have performed recovery and found the txn to be implicitly committed, or the recovery might succeed in preventing an intent from being written and thus the transaction will be aborted. The commiter thus reads the txn record and tries to figure out what the deal is. The case in which a COMMITed txn record is found is unambiguous. The case when the txn record is not found is ambiguous: it could have been GCed, and the status is lost. This case stays ambiguous. The case where the transaction record is found to be ABORTED used to be treated as ambiguous, but this patch makes it unambiguous. It used to be ambiguous for a bad reason: the QueryTxn request used to read the txn record conflated an ABORTED record with a missing record that cannot be created anymore because of the tscache protections. This patch adds more output to QueryTxn, specifying if the record has been found or if it has been synthesized. This allows the DistSender to eliminate the ambiguity when the record is found. This patch relies on the fact that nobody writes an ABORTED record (or, any record) after the record is GC'ed. Fixes #52566 Fixes #52186 One of the issues above is a TPCC failure with an ambiguous error. I'm not sure that this was the ambiguity in question, but it seems likely. Release note (general change): Some cases where AmbiguousResultErrors were produced under high load now return non-ambiguous, retriable errors. 53157: build: add libgeos libs to docker image for publish release r=pbardea a=jlinder Before: The script was written to be backportable (without adding the libgeos libs to the docker image). Why: Geospatial arrived in master (20.2) and the libgeos libs need to be included with the docker image. Now: The libs are included in the docker image generated in the teamcity-publish-release.sh script. Release note: None Co-authored-by: Solon Gordon <[email protected]> Co-authored-by: arulajmani <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: James H. Linder <[email protected]>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Aug 31, 2020
Fixes cockroachdb#53189. Fixes cockroachdb#53282. Fixes cockroachdb#53285. Fixes cockroachdb#53469. 3dcb6f1 improved the detection of missing intents during parallel commit attempts to distinguish between certain classes of ambiguous errors and transaction aborted errors. This was a nice improvement, as it dramatically reduced the number of situations where we returned ambiguous errors during normal operation (see cockroachdb#52566). However, in introducing a new location where transaction retry errors could be generated, it accidentally violated the invariant that all transaction retry errors have transaction protos attached to them. This was causing panics in TPC-C roachtests. This commit fixes this issue by properly attaching transaction protos to these new errors, along with any others returned from `detectIntentMissingDueToIntentResolution`. Release justification: bug fix
craig bot
pushed a commit
that referenced
this issue
Sep 1, 2020
53589: jobs: improve job adoption r=ajwerner a=ajwerner #### jobs: don't hold mutex during adoption, launch in parallel #### jobs: break up new stages of job lifecycle movement In the PR which adopted the sqlliveness sessions, we shoved all of the stages of adopting jobs into the same stage and we invoked that stage on each adoption interval and on each sent to the adoption channel. These stages are: * Cancel jobs * Serve pause and cancel requests * Delete claims due to dead sessions * Claim jobs * Process claimed jobs This is problematic for tests which send on the adoption channel at a high rate. One important thing to note is that all jobs which are sent on the adoption channel are already claimed. After this PR we move the first three steps above into the cancellation loop we were already running. We also increase the default interval for that loop as it was exceedingly frequent at 1s for no obvious reason. Much of the testing changes are due to this cancelation loop duration change. The tests in this package now run 3x faster (10s vs 30s). Then, upon sends on the adoption channel, we just process claimed jobs. When the adoption interval rolls around, then we attempt to both claim and process jobs. Release justification: bug fixes and low-risk updates to new functionality Release note: None 53697: kv: attach txn to error from detectIntentMissingDueToIntentResolution r=nvanbenschoten a=nvanbenschoten Fixes #53189. Fixes #53282. Fixes #53285. Fixes #53469. 3dcb6f1 improved the detection of missing intents during parallel commit attempts to distinguish between certain classes of ambiguous errors and transaction aborted errors. This was a nice improvement, as it dramatically reduced the number of situations where we returned ambiguous errors during normal operation (see #52566). However, in introducing a new location where transaction retry errors could be generated, it accidentally violated the invariant that all transaction retry errors have transaction protos attached to them. This was causing panics in TPC-C roachtests. This commit fixes this issue by properly attaching transaction protos to these new errors, along with any others returned from `detectIntentMissingDueToIntentResolution`. Release justification: bug fix 53708: builtins: implement ST_MemCollect and ST_MemUnion r=otan a=erikgrinaker These are implemented as aliases for ST_Collect and ST_Union. In PostGIS these are memory-optimized versions, but for now it should be sufficient to simply make them functional. Release justification: low risk, high benefit changes to existing functionality Release note (sql change): Implement the geometry aggregate builtins `ST_MemCollect` and `ST_MemUnion`. Closes #48984. Closes #48986. Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
S-3-ux-surprise
Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Address the following TODO:
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 1027 to 1034 in fdf54eb
While here, we'll also want to make sure that only a transaction's own coordinator will ever GC its record below the
TxnCleanupThreshold
(1 hour). This seems to already be the case everywhere except for in therangefeedTxnPusher
, which both pushes transactions and GCs their record. We should stop doing the second part.The text was updated successfully, but these errors were encountered: