Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: deleting rows from table A hangs after failed insert transaction to related table B #48790

Closed
andy-kimball opened this issue May 13, 2020 · 19 comments · Fixed by #49218
Closed
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-investigation Further steps needed to qualify. C-label will change.

Comments

@andy-kimball
Copy link
Contributor

andy-kimball commented May 13, 2020

Repro:

root@:26257/defaultdb> CREATE TABLE wallet (
    id bigserial primary key,
    name text not null,
    gender int,
    email text,
    first_name text,
    last_name text,
    creation_date timestamp not null,
    situation int,
    balance decimal not null,
    is_blocked bool,
    INDEX (name),
    INDEX (situation),
    INDEX (is_blocked),
    INDEX (balance)
);
CREATE TABLE

root@:26257/defaultdb> insert into wallet (name, creation_date, balance) select 'foo', now(), 1.0 from generate_series(1,100000);
INSERT 100000

root@:26257/defaultdb> CREATE TABLE transaction (
    id bigserial primary key,
    sender_id bigint REFERENCES wallet(id) ON DELETE CASCADE,
    receiver_id bigint REFERENCES wallet(id) ON DELETE CASCADE,
    amount decimal not null,
    creation_date timestamp not null,
    last_update timestamp,
    schedule_date timestamp,
    status int,
    comment text,
    linked_trans_id bigint REFERENCES transaction(id) ON DELETE CASCADE,
    c1 text,
    c2 text,
    c3 text,
    INDEX (sender_id),
    INDEX (receiver_id),
    INDEX (linked_trans_id)
);
CREATE TABLE

root@:26257/defaultdb> insert into transaction (sender_id, receiver_id, amount, creation_date, linked_trans_id) select i%100000+1,(i+1)%100000+1,1.0,now(),NULL from generate_series(1,1000000) v(i);
ERROR: insert on table "transaction" violates foreign key constraint "fk_sender_id_ref_wallet"
SQLSTATE: 23503
DETAIL: Key (sender_id)=(2) is not present in table "wallet".

root@:26257/defaultdb> delete from wallet where 1=1;

The last statement hangs indefinitely (or at least longer than the 5 minutes I waited). By contrast, if I haven't run the insert into transaction statement, the delete from wallet where 1=1 completes in ~30s.

@blathers-crl
Copy link

blathers-crl bot commented May 13, 2020

Hi @andy-kimball, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@andy-kimball
Copy link
Contributor Author

@nvanbenschoten, does this look like a KV bug or a SQL bug to you? Maybe intents not getting cleaned up after abort? Or maybe some kind of degenerate N**2 algorithm when #rows deleted is so large?

@nvanbenschoten
Copy link
Member

Which version of Cockroach are you running? Is there anything else going on in this cluster? What happens if you wait a few seconds after the failed insert?

This might be an instance of #36876 - which at one point I had tried to address with #31664. In that issue, we see that intent resolution for the cleaned-up transaction occurs one at a time. This was probably even made a little worse by the lockTable, which added a 10ms delay before pushing. This is discussed in:

// This is set to a short duration to ensure that we quickly detect failed
// transaction coordinators that have abandoned one or many locks. We don't
// want to wait out a long timeout on each of these locks to detect that
// they are abandoned. However, we also don't want to push immediately in
// cases where the lock is going to be resolved shortly.
//
// We could increase this default to somewhere on the order of the
// transaction heartbeat timeout (5s) if we had a better way to avoid paying
// the cost on each of a transaction's abandoned locks and instead only pay
// it once per abandoned transaction per range or per node. This could come
// in a few different forms, including:
// - a per-wide cache of recently detected abandoned transaction IDs
// - a per-range reverse index from transaction ID to locked keys
//
// TODO(nvanbenschoten): increasing this default value.

One thing you can try is running SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '0'; and seeing if that changes anything.

@nvanbenschoten nvanbenschoten self-assigned this May 13, 2020
@nvanbenschoten nvanbenschoten added A-kv-transactions Relating to MVCC and the transactional model. C-investigation Further steps needed to qualify. C-label will change. labels May 13, 2020
@nvanbenschoten
Copy link
Member

I'm able to reproduce. The DELETE took 5m33s to complete.

I think the new lockTableWaiter structure can help us out a lot here, actually! Part of the reason why #31664 was so tricky was the we didn't have a central place where conflict resolution was performed and, therefore, needed to pass a "known finalized transaction" cache around on the transaction proto itself. The lockTable and lockTableWaiter improve upon this substantially. We now have a single function (lockTableWaiter.WaitOn) which handles conflict resolution for all conflicting intents on a range for a request, all at once. So we should be able to maintain such a cache as a temporary stack value, allowing us to skip this delay and pushing the transaction record for all but the first abandoned intent that a pushee finds. Since we store the cache on the stack, we avoid all questions of cache sizing, lifetimes, evictions, etc. We also avoid any cross-request synchronization. I'll prototype this and see if it helps.

@andy-kimball
Copy link
Contributor Author

Ha, so I stopped the transaction just before it would have completed. I also tried SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '0', and it does indeed "fix" the problem. But if that setting is only 10ms by default, how does that add up to minutes of extra time with only 1M abandoned rows?

@nvanbenschoten
Copy link
Member

What happens if you wait a few seconds after the failed insert?

I tried this out. I waited 5 minutes before running the DELETE and sure enough, the DELETE only took 58s. I'm sure if I waited a bit longer, all of the intents would have been cleaned up and it would have been instantaneous.

how does that add up to minutes of extra time with only 1M abandoned rows?

Because we wait it out on each abandoned intent before pushing that intent's record. Only then do we finally resolve each intent. The known finalized cache would avoid the first two steps for all but the first intent from a given transaction that a request finds on a range. So it would cut the time down from:

1M * (10ms + latency(push_txn_record) + latency (resolve_intent)

to

10ms + latency(push_txn_record) + 1M * latency (resolve_intent)

@andy-kimball
Copy link
Contributor Author

How come it's proportional to #rows rather than #ranges? Is there some reason for not being able to batch?

@nvanbenschoten
Copy link
Member

Because we eagerly push/resolve intents as we discover them (that part's not new) instead of aggregating a list of them to resolve all at once. I think the structure we're proposing here would allow that though, with some tricky interactions with the lockTable itself. We could hang a list of intent keys off of the known finalized cache we're discussing here so that we batch the resolution of subsequent intents once we've discovered that a transaction has abandoned intents on a range.

@andy-kimball
Copy link
Contributor Author

andy-kimball commented May 13, 2020

BTW, I'm still not seeing my DELETE transaction complete even after waiting 15 minutes. So not sure why it completed in only 5 minutes for you. I'll let it run for an hour or so to see if it ever completes.

Based on your formula above, and assuming 0 latency for push/resolve, shouldn't the time be 1M * 10ms = 2.8 hours?

@nvanbenschoten
Copy link
Member

Based on your formula above, and assuming 0 latency for push/resolve, shouldn't the time be 1M * 10ms = 2.8 hours?

If the transaction coordinator actually died, yes. But it's still around cleaning up some of these intents, right?

@andy-kimball
Copy link
Contributor Author

It's just 1-node CRDB on my laptop. Nothing died that I know of . : )

@nvanbenschoten
Copy link
Member

Also, it might not have inserted all 1M rows, right? Actually, it looks like it inserts 100k rows before it fails.

It's worth mentioning that this second transaction takes 51s before it hits the FK violation, where it's inserting as fast as it can, so we're not going to be able to do all that much better on the cleanup path.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 13, 2020

This is actually doing something different than I thought it was. insert into wallet (name, creation_date, balance) select 'foo', now(), 1.0 from generate_series(1,100000); doesn't insert sequential IDs, because the example is using the unique_rowid for of SERIAL (https://www.cockroachlabs.com/docs/stable/serial.html#modes-of-operation), so it must be hitting the FK violation in the first batch of 20,000 rows. We may be abandoning fewer intents than we think we are.

I have a simplified version of this issue and some results from prototyping:

  1. a known transaction cache, as discussed in sql: deleting rows from table A hangs after failed insert transaction to related table B #48790 (comment)
  2. a known transaction cache + intent resolution batching, as discussed in sql: deleting rows from table A hangs after failed insert transaction to related table B #48790 (comment)

I'll post them shortly.

@nvanbenschoten
Copy link
Member

They just finished. This looks pretty good:

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m48.71064089s


SET CLUSTER SETTING kv.lock_table.track_known_txns = true;
CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys2;

  k
-----
(0 rows)

Time: 1m5.023374379s


SET CLUSTER SETTING kv.lock_table.batch_known_txns = true;
CREATE TABLE keys3 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys3 SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys3;

  k
-----
(0 rows)

Time: 5.19412057s

@nvanbenschoten
Copy link
Member

In the test, we abandon 10,000 intents by rolling back a transaction with this patch:

diff --git a/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go b/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go
index 0d2ad741cb..2c9bdde560 100644
--- a/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go
+++ b/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go
@@ -278,6 +278,9 @@ func (tp *txnPipeliner) attachLocksToEndTxn(
        if len(et.InFlightWrites) > 0 {
                return ba, roachpb.NewErrorf("client must not pass in-flight writes to EndTxn")
        }
+       if !et.Commit {
+               return ba, nil
+       }

        // Populate et.LockSpans and et.InFlightWrites.
        if !tp.lockFootprint.empty() {

We then scan the table and measure how long it takes us to clean up those 10,000 intents. In the first test, it takes 2m48s. Once we remember transactions that we already found were finalized, this recovery time drops to 1m5s. Once we batch the intent resolution of intents that are part of the same transaction that we already know are finalized, this recovery time drops to 5s.

The last part has a bit of a delicate interaction with the lockTable because we need to virtually resolve those intents (remove them from the lockTable before they are removed from MVCC) to allow the lockTableWaiter iteration to proceed while we build up the batch, but the existing interface allows us to do this without much hassle because we already support the lockTable maintaining only a partial view on the MVCC state.

@nvanbenschoten
Copy link
Member

I think this might actually be a regression in v20.1. We used to perform this kind of batching in certain cases before the lockTable rework. v19.2 (also hacked to abandon intents) shows this taking about 5.7s.

@andy-kimball
Copy link
Contributor Author

Interesting results. How disruptive are the fixes? Are they something we could backport to 20.1?

@nvanbenschoten
Copy link
Member

Yes, I think we can backport the fix. It's surprisingly small.

@andy-kimball
Copy link
Contributor Author

Nice, I can definitely see customers hitting this "in the wild".

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 19, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 23, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
craig bot pushed a commit that referenced this issue May 26, 2020
49218: kv/concurrency: avoid redundant txn pushes and batch intent resolution r=nvanbenschoten a=nvanbenschoten

Fixes #48790.
Informs #36876.
Closes #31664.

This commit adds a per-Range LRU cache of transactions that are known to be aborted or committed. We use this cache in the lockTableWaiter for two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out the `kv.lock_table.coordinator_liveness_push_delay` (10 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that it may have abandoned multiple intents, perhaps due to a failure of the transaction coordinator node, so we begin deferring intent resolution to enable batching.

Together, these two changes make us much more effective as cleaning up after failed transactions that have abandoned a large number of intents. The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s



--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the insertion path as we are at doing so on the retrieval path. This is because we only batch the resolution of intents observed by a single request at a time. For the scanning case, a single ScanRequest notices all 10,000 intents and cleans them all up together. For the insertion case, each of the 10,000 PutRequests notices a single intent, and each intent is cleaned up individually. So this case is only benefited by the first part of this change (no liveness delay or txn record push) and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of #36876. To completely address that, we'll need to defer propagation of `WriteIntentError` during batch evaluation, as we do for `WriteTooOldError`s. Or we can wait out the future LockTable changes - once we remove all cases where an intent is not "discovered", the changes here will effectively address #36876.

This was a partial regression in v20.1, so we'll want to backport this to that release branch. This change is on the larger side, but I feel ok about it because the mechanics aren't too tricky. I'll wait a week before backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction coordinators are now cleaned up much faster. This resolves a regression in v20.1.0 compared to prior releases.

@irfansharif I'm adding you as a reviewer because there's not really anyone else on KV that knows this code, so we should change that.

49557: kvserver: remove migration to remove preemptive snapshots r=nvanbenschoten a=ajwerner

This migration ran in 20.1 to remove pre-emptive snapshots which may have
existed from before 19.2 was finalized. This migration is no longer relevant.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in 67c6bdb May 26, 2020
jbowens pushed a commit to jbowens/cockroach that referenced this issue Jun 1, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 3, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants