Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: replayed large transactions must serially resolve each intent #36876

Closed
nvanbenschoten opened this issue Apr 16, 2019 · 2 comments
Closed
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity T-kv KV Team X-stale

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Apr 16, 2019

Extracted from #18684 (comment).

The workload is running workload init tpcc, so it's bulk ingesting a large amount of data. The specific queries being run are INSERT INTO ... VAUES ... statements with 1000 rows each. What we see happen in the trace of this specific statement is that it divides up its InitPut requests across ranges serially because we blow the DistSender concurrency limit out of this water while bulk loading. We then see the first InitPut hit a conflicting intent. It checks the intent's txn, finds that it is aborted, resolves the intent, then writes. Harmless enough, although we do see that the whole process take around 200ms. Then the next InitPut hits a conflicting intent. Same story. Then all 998 other writes in the statement hit the intents for the exact same aborted transaction. We must be retrying after an aborted transaction for the same statement which laid down intents for all 1000 keys but failed to clean them up!

Jira issue: CRDB-4478

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Apr 16, 2019
@nvanbenschoten
Copy link
Member Author

An easy way to hit these same symptoms is to disable async intent resolution on rollback using something like the following patch (simulating a txn coordinator failure):

diff --git a/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go b/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go
index 0d2ad741cb..2c9bdde560 100644
--- a/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go
+++ b/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go
@@ -278,6 +278,9 @@ func (tp *txnPipeliner) attachLocksToEndTxn(
        if len(et.InFlightWrites) > 0 {
                return ba, roachpb.NewErrorf("client must not pass in-flight writes to EndTxn")
        }
+       if !et.Commit {
+               return ba, nil
+       }

        // Populate et.LockSpans and et.InFlightWrites.
        if !tp.lockFootprint.empty() {

Then run the following:

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys SELECT generate_series(1, 10000);

INSERT 10000

Time: 55.672161601s

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 19, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 23, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 26, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
craig bot pushed a commit that referenced this issue May 26, 2020
49218: kv/concurrency: avoid redundant txn pushes and batch intent resolution r=nvanbenschoten a=nvanbenschoten

Fixes #48790.
Informs #36876.
Closes #31664.

This commit adds a per-Range LRU cache of transactions that are known to be aborted or committed. We use this cache in the lockTableWaiter for two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out the `kv.lock_table.coordinator_liveness_push_delay` (10 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that it may have abandoned multiple intents, perhaps due to a failure of the transaction coordinator node, so we begin deferring intent resolution to enable batching.

Together, these two changes make us much more effective as cleaning up after failed transactions that have abandoned a large number of intents. The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s



--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the insertion path as we are at doing so on the retrieval path. This is because we only batch the resolution of intents observed by a single request at a time. For the scanning case, a single ScanRequest notices all 10,000 intents and cleans them all up together. For the insertion case, each of the 10,000 PutRequests notices a single intent, and each intent is cleaned up individually. So this case is only benefited by the first part of this change (no liveness delay or txn record push) and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of #36876. To completely address that, we'll need to defer propagation of `WriteIntentError` during batch evaluation, as we do for `WriteTooOldError`s. Or we can wait out the future LockTable changes - once we remove all cases where an intent is not "discovered", the changes here will effectively address #36876.

This was a partial regression in v20.1, so we'll want to backport this to that release branch. This change is on the larger side, but I feel ok about it because the mechanics aren't too tricky. I'll wait a week before backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction coordinators are now cleaned up much faster. This resolves a regression in v20.1.0 compared to prior releases.

@irfansharif I'm adding you as a reviewer because there's not really anyone else on KV that knows this code, so we should change that.

49557: kvserver: remove migration to remove preemptive snapshots r=nvanbenschoten a=ajwerner

This migration ran in 20.1 to remove pre-emptive snapshots which may have
existed from before 19.2 was finalized. This migration is no longer relevant.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
jbowens pushed a commit to jbowens/cockroach that referenced this issue Jun 1, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 3, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity T-kv KV Team X-stale
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants