Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: intents from transactions that have been successfully PUSH_TIMESTAMP-ed is O(num intents) #103126

Closed
arulajmani opened this issue May 11, 2023 · 1 comment · Fixed by #103265 or #104784
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-kv KV Team

Comments

@arulajmani
Copy link
Collaborator

arulajmani commented May 11, 2023

Describe the problem

Higher priority readers in CRDB are able to push lower priority writers above their timestamp, thus allowing readers to proceed without conflicting with the writer. However, if the writer has written intents at the lower timestamp, the reader needs to resolve them (i.e move them to the higher timestamp) before it can proceed with its scan.

Today, this process is O(num intents) -- the reader pushes the writer every time it discovers a conflicting intent and resolves that intent before proceeding to the next one. This happens here:

return w.ir.ResolveIntent(ctx, resolve, opts)

To Reproduce

-- session 1
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 1000);

-- session 2
BEGIN PRIORITY HIGH; SELECT count(*) FROM keys;

-- takes ~7ms per intent

Proposed solution

Prior to #49218, this problem existed for finalized (committed, but more notably, aborted) transactions as well. That patch introduced the finalizedTxnCache which is added to here:

// If the transaction is finalized, add it to the finalizedTxnCache. This
// avoids needing to push it again if we find another one of its locks and
// allows for batching of intent resolution.
if pusheeTxn.Status.IsFinalized() {
w.lt.TransactionIsFinalized(pusheeTxn)
}

This ensures a subsequent re-scan of the lock table (post intent resolution) do not have to push the same transaction again to recognize that it is finalized. Instead, it can just collect intents from the finalized transaction and batch resolve them in one go. This means that intent resolution is O(num ranges) for finalized transactions instead of O(intents)[*].

We should extend this concept for transactions that are known to have been pushed to a higher timestamp as well. This would allow high priority readers to collect and batch resolve intents in similar fashion.

[*] Assuming no async-intent resolution and that the readers read set includes all intents written by the writer.

Additional context

Notably, this impacts backups (which eventually run high priority ExportRequests). Backups run high priority ExportRequests so that they aren't starved by concurrent writers; however, if we have writer that's writing a high enough amount of intents, the backup can indeed be starved.

cc @nvanbenschoten @adityamaru

Jira issue: CRDB-27844

@arulajmani arulajmani added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. T-kv KV Team labels May 11, 2023
@irfansharif irfansharif added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label May 11, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 14, 2023
…uests

Fixes cockroachdb#50390.
Related to cockroachdb#60585.
Related to cockroachdb#103126.

Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar
approach. This commit addresses the comments on that PR, which focused on the
pagination of intent resolution when bypassing the request batcher. We still
don't try to solve this issue, and instead limit the cases where intent
resolution bypasses the request batcher to those where pagination is not
necessary.

This commit adds a new `sendImmediately` option to the `IntentResolver`
`ResolveOptions`, which instructs the `IntentResolver` to send the provided
intent resolution requests immediately, instead of adding them to a batch and
waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up
with other intent resolution requests. This can be used to avoid any
batching-induced latency and should be used only by foreground traffic that is
willing to trade off some throughput for lower latency.

The commit then sets this flag for single-key intent resolution requests initiated
by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are
performed by background tasks that are more than happy to trade increased
latency for increased throughput, the `lockTableWaiter` calls into the
`IntentResolver` on behalf of a foreground operation. It wants intent resolution
to complete as soon as possible, so it is willing to trade reduced throughput
for reduced latency.

I tested this out by writing 10,000 different intents in a normal-priority transaction
and then scanning over the table using a high-priority transaction. The test was
performed on a 3-node roachprod cluster to demonstrate the effect with realistic
network and disk latencies.
```
-- session 1
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000);

-- session 2
BEGIN PRIORITY HIGH; SELECT count(*) FROM keys;
```

Without this change, the scan took 70.078s. With this change, the scan took
15.958s. This 78% speed up checks out. Each encountered intent is resolved
serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This
improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`,
which delays each resolution request that passes through a RequestBatcher. With
this change, these resolve intent requests are issued immediately and this delay
is not experienced.

While this is a significant improvement and will be important for Read Committed
transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126.
For that, we need to batch the resolve intent requests themselves using a form
of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions).

A similar improvement is seen for scans that encounter many abandoned intents
from many different transactions. This scenario bypasses the deferred intent
resolution path added in cockroachdb#49218, because the intents are each written by
different transactions. The speedup for this scenario was presented in cockroachdb#64500.

Release note (performance improvement): SQL statements that must clean up
intents from many different previously abandoned transactions now do so
moderately more efficiently.
craig bot pushed a commit that referenced this issue May 18, 2023
103265: kv: resolve conflicting intents immediately for latency-sensitive requests r=nvanbenschoten a=nvanbenschoten

Fixes #50390.
Related to #60585.
Related to #103126.

Closes #64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary.

This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (`defaultIntentResolutionBatchWait`) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency.

The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency.

I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies.
```sql
-- session 1
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000);

-- session 2
BEGIN PRIORITY HIGH; SELECT count(*) FROM keys;
```

Without this change, the scan took **70.078s**. With this change, the scan took **15.958s**. This **78%** speed-up checks out. Each encountered intent is resolved serially (see #103126), so the **per-intent latency** drops from **7ms** to **1.6ms.** This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced.

While this is a significant improvement and will be important for Read Committed transactions after #102014, this is still not quite enough to resolve #103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in #49218 (for finalized transactions).

A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in #49218, because the intents are each written by different transactions. The speedup for this scenario was presented in #64500.

Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently.


103362: sql: validate primary / secondary region localities at end of txn r=fqazi a=fqazi

Previously, if a database was restored with skip_localities, there was no way to modify this database to set the primary region since validation would kick in too early during the statement. This meant fixing the regions in a restored database was impossible if the primary region was no longer valid. To address this, this patch, delays locality validation till the end of the transaction.

Fixes: #103290

Release note (bug fix): SET PRIMARY REGION and SET SECONDARY REGION did not validate transactionally, which could prevent cleaning up removed regions after a restore.

103373: concurrency: small refactors in preparation for reservation removal r=nvanbenschoten a=arulajmani

See individual commits for details. 

Informs: #103361

103538: go.mod: bump Pebble to 6f2788660198, rework shared storage wrapper r=RaduBerinde a=RaduBerinde

6f278866 shared: improve interface for more efficient reading
9eb2c407 db: log events to testing.T in unit tests
f32e7dc6 db: add reserved Pebblev4 sstable format
5a6b91b8 objstorage: improve test and add read ahead test
2bc4319e objstorage: remove genericFileReadable
8143ffb9 objstorage: fix readaheadState initialization
06d08888 db: add Compact.Duration metric
e7213de0 db: add Uptime metric
e9005aed db: don't delete files during ingest application
222b43ec internal/arenaskl: fix Skiplist doc typo

Release note: None
Epic: none

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@craig craig bot closed this as completed in 3fc29ad May 18, 2023
@nvanbenschoten
Copy link
Member

this is still not quite enough to resolve #103126

Thanks GitHub.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 13, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 19, 2023
This commit adds a new `kv.lock_table.batch_pushed_lock_resolution.enabled` cluster
setting which controls whether the lock table should allow non-locking readers
to defer and batch the resolution of conflicting locks whose holder is known to
be pending and have been pushed above the reader's timestamp.

This is a safeguard against bugs or behavior changes as we quickly backport a
fix for cockroachdb#103126.

Epic: None
Release note: None
craig bot pushed a commit that referenced this issue Jun 20, 2023
104784: kv/concurrency: batch intent resolution of pushed intents from same txn r=arulajmani a=nvanbenschoten

Fixes #103126.

This commit extends the infrastructure introduced in #49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching.

Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from #103126) demonstrates this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before #103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see #94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.

Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 23ee0a9 Jun 20, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 13, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 13, 2023
This commit adds a new `kv.lock_table.batch_pushed_lock_resolution.enabled` cluster
setting which controls whether the lock table should allow non-locking readers
to defer and batch the resolution of conflicting locks whose holder is known to
be pending and have been pushed above the reader's timestamp.

This is a safeguard against bugs or behavior changes as we quickly backport a
fix for cockroachdb#103126.

Epic: None
Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 13, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 13, 2023
This commit adds a new `kv.lock_table.batch_pushed_lock_resolution.enabled` cluster
setting which controls whether the lock table should allow non-locking readers
to defer and batch the resolution of conflicting locks whose holder is known to
be pending and have been pushed above the reader's timestamp.

This is a safeguard against bugs or behavior changes as we quickly backport a
fix for cockroachdb#103126.

Epic: None
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-kv KV Team
Projects
None yet
3 participants