-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv/contention/nodes=4 failed #53518
Comments
@nvanbenschoten can you take a look at this one on Monday as well? Seems less dramatic than the conc mgr panic but it's also on the 20.2-beta list. |
FWIW the exact error is |
I was looking into this last week. There's definitely something real here. About 20% of tests seem to see 10 minute transaction stalls repeatedly, which is what we would expect if transactions end up waiting for someone else to expire. This could be an indication of an abandoned transaction that is not cleaned up eagerly. Or something else. I also stumbled upon #53677 while looking into this, which is moderately concerning. |
TL;DR: not a release blocker. I think I see what's going on here. In #45568, we added the What we were seeing in this test is that this situation would occasionally cause issues if an in-flight So we could fix this by ensuring that we delete any existing transaction record on a 1PC transaction, but this actually seems misguided. Doing so is an added cost that we only need with I'll also make a note that heartbeating a transaction that is attempting a 1PC batch is not only unnecessary, but is also unsafe. I don't think it could actually cause correctness problems because a 1PC txn leaves no intents, but it opens the door to all kinds of weird issues like this. |
I take this back. As of #53132, we actually begin heartbeating before the 1PC batch in this test. This is because we acquire unreplicated locks during the initial For the same reason we need to inform the concurrency manager of an updated txn on 1PC transactions, we'll also need to handle the case where a 1PC transaction already has a PENDING record: cockroach/pkg/kv/kvserver/replica_write.go Lines 485 to 498 in e8fe416
This is an important optimization that allows us to acquire unreplicated locks and then remove them during a 1PC batch. It's the only reason we perform decently well on YCSB these days. But as mentioned above, this isn't really safe. So what we could do is perform a read for a transaction record in So to avoid a new disk read for 1PC transactions that have not previously acquired unreplicated locks, my plan is:
Luckily, this is all an optimization to avoid waiting out a transaction's expiration, so we can introduce it and backport it without a migration. cc. @andreimatei in case you have any thoughts here. |
Not attempting 1PC evaluation if we've created the txn record sounds right to me. What's the alternative? Some bastardized 1PC evaluation that also deletes the txn record? Seems weird to me. |
SGTM as well. |
Fixes cockroachdb#53403. Fixes cockroachdb#53518. Fixes cockroachdb#53772 Fixes cockroachdb#54094. This commit disables one-phase commit evaluation for transactions with existing PENDING transaction records. As we saw in cockroachdb#53518 (comment), failing to do so can lead to a transaction that commits on the one-phase commit fast-path but still has a PENDING transaction record. It doesn't seem like this can actually cause serious correctness issues today other than in the presence of replays because a 1PC transaction does not have remote intents (by definition). However, it was creating the appearance of abandoned transaction records in `kv/contention/nodes=4` and causing that test to fail. This commit needs a backport to release-20.2 and release-20.1. This was not an issue before release-20.1 because before then, we would never begin a transaction's heartbeat loop for 1PC transactions. This changed in v20.1 because of unreplicated locks. We allow transactions that acquire unreplicated locks to still hit the one-phase commit fast-path, but we also need to start heartbeating once a transaction has acquired any locks so that it doesn't get aborted by conflicting transactions. In the vast majority of these cases, the heartbeat loop will never actually fire (for any txn that takes less than 1s), so with this change, we'll still be able to perform a 1PC evaluation. However, this is adding in a disk read for those cases, which is a little disappointing but doesn't seem easy to avoid without disabling the heartbeat loop before issuing the 1PC batch (another alternative, happy to discuss). The upside of this is that we now have enough information on the server to avoid a bit of work for 1PC txns that have not previously acquired locks (see TODO in evaluate1PC).
54230: kv: disallow 1PC evaluation when heartbeating and txn record present r=nvanbenschoten a=nvanbenschoten Fixes #53403. Fixes #53518. Fixes #53772 Fixes #54094. This commit disables one-phase commit evaluation for transactions with existing PENDING transaction records. As we saw in #53518 (comment), failing to do so can lead to a transaction that commits on the one-phase commit fast-path but still has a PENDING transaction record. It doesn't seem like this can actually cause serious correctness issues today other than in the presence of replays because a 1PC transaction does not have remote intents (by definition). However, it was creating the appearance of abandoned transaction records in `kv/contention/nodes=4` and causing that test to fail. This commit needs a backport to release-20.2 and release-20.1. This was not an issue before release-20.1 because before then, we would never begin a transaction's heartbeat loop for 1PC transactions. This changed in v20.1 because of unreplicated locks. We allow transactions that acquire unreplicated locks to still hit the one-phase commit fast-path, but we also need to start heartbeating once a transaction has acquired any locks so that it doesn't get aborted by conflicting transactions. In the vast majority of these cases, the heartbeat loop will never actually fire (for any txn that takes less than 1s), so with this change, we'll still be able to perform a 1PC evaluation. However, this is adding in a disk read for those cases, which is a little disappointing but doesn't seem easy to avoid without disabling the heartbeat loop before issuing the 1PC batch (another alternative, happy to discuss). The upside of this is that we now have enough information on the server to avoid a bit of work for 1PC txns that have not previously acquired locks (see TODO in evaluate1PC). 54325: opt: unwrap explain.Node in ConstructScanBuffer r=RaduBerinde a=RaduBerinde When the new explain infrastructure is in use, the plan is built against an explain.Factory but the "inner" recursive CTE plan is built against a normal factory. This leads to an internal error. To avoid this, we unwrap the node in `ConstructScanBuffer`. Note that the new explain infrastructure is used automatically for the first instance of a query fingerprint, in order to populate the plan in the UI. Fixes #54324. Release note (bug fix): fixed an internal error in some cases when recursive CTEs are used. 54350: roachtest: disable load-based splitting in copy/bank/rows=10000000,nodes=9,txn=false r=nvanbenschoten a=nvanbenschoten Fixes #54301. Speculative fix for that roachtest. 54351: workload/schemachange: add enum support r=otan a=ajwerner This commit adds support for creating enums, adding enum columns, and using enums to insert data into rows. This found a bug so I'm pleased. Release note: None 54352: kv: increment bytesSent, not batchSize in kvBatchSnapshotStrategy.sendBatch r=nvanbenschoten a=nvanbenschoten Partly responsible for #54311. This commit fixes a bug introduced in #48579 where the snapshot batch size was increased when each batch was sent instead of the bytesSent metric. This had two effects: 1. it undermined the memory footprint limit (256 KB) placed on snapshot senders by doubling the batch size on each subsequent batch. 2. it failed to track the snapshot data rate properly, so the log message introduced in #48579 always contained "0 B/s". This needs to be backported to release-20.1 and release-20.2. Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
Fixes cockroachdb#53403. Fixes cockroachdb#53518. Fixes cockroachdb#53772 Fixes cockroachdb#54094. This commit disables one-phase commit evaluation for transactions with existing PENDING transaction records. As we saw in cockroachdb#53518 (comment), failing to do so can lead to a transaction that commits on the one-phase commit fast-path but still has a PENDING transaction record. It doesn't seem like this can actually cause serious correctness issues today other than in the presence of replays because a 1PC transaction does not have remote intents (by definition). However, it was creating the appearance of abandoned transaction records in `kv/contention/nodes=4` and causing that test to fail. This commit needs a backport to release-20.2 and release-20.1. This was not an issue before release-20.1 because before then, we would never begin a transaction's heartbeat loop for 1PC transactions. This changed in v20.1 because of unreplicated locks. We allow transactions that acquire unreplicated locks to still hit the one-phase commit fast-path, but we also need to start heartbeating once a transaction has acquired any locks so that it doesn't get aborted by conflicting transactions. In the vast majority of these cases, the heartbeat loop will never actually fire (for any txn that takes less than 1s), so with this change, we'll still be able to perform a 1PC evaluation. However, this is adding in a disk read for those cases, which is a little disappointing but doesn't seem easy to avoid without disabling the heartbeat loop before issuing the 1PC batch (another alternative, happy to discuss). The upside of this is that we now have enough information on the server to avoid a bit of work for 1PC txns that have not previously acquired locks (see TODO in evaluate1PC).
(roachtest).kv/contention/nodes=4 failed on provisional_202008261913_v20.2.0-beta.1@eaa939ce6548a54a23970814ff00f30ad87680ac:
More
Artifacts: /kv/contention/nodes=4
Related:
roachtest: kv/contention/nodes=4 failed #53403 roachtest: kv/contention/nodes=4 failed C-test-failure O-roachtest O-robot branch-master release-blocker
roachtest: kv/contention/nodes=4 failed #52878 roachtest: kv/contention/nodes=4 failed C-test-failure O-roachtest O-robot branch-provisional_202008151325_v19.2.10 release-blocker
roachtest: kv/contention/nodes=4 failed #45698 roachtest: kv/contention/nodes=4 failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: