Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: batch evaluation should operate on a consistent storage snapshot #55461

Closed
nvanbenschoten opened this issue Oct 12, 2020 · 1 comment · Fixed by #76312
Closed

kv: batch evaluation should operate on a consistent storage snapshot #55461

nvanbenschoten opened this issue Oct 12, 2020 · 1 comment · Fixed by #76312
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Oct 12, 2020

Neither Engine.NewReadOnly nor Engine.NewBatch actually grabs a stable storage snapshot. Instead, they wait until iterator creation to do so, and then cache the iterator across multiple uses. This leads to two issues.

First, it is unclear when the storage snapshot is captured, making it difficult to coordinate with other state changes without more aggressive serialization using latches or the readOnlyCmdMu. Ideally, users would be able to grab a read-only/batch, check a few conditions, and then decide whether evaluation should proceed using the storage snapshot. This comes up in the context of replica destruction and in the context of MVCC GC.

Second, both of these objects cache multiple types of iterators, including a prefix iterator and a non-prefix iterator. This means that neither reader actually provides a stable storage snapshot at any point. Users have to be ready for inconsistencies to arise between different iterators pulled from the same read-only/batch. This leads to bugs like #47219, where state read from one iterator may not agree with state read from another.

Related Slack thread: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1602087652111000.

Jira issue: CRDB-3659

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-transactions Relating to MVCC and the transactional model. labels Oct 12, 2020
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jan 6, 2021
For pebbleBatch and pebbleReadOnly, all iterators without timestamp
hints see the same underlying engine state, with no interface
change for the caller. This is done by cloning the first created
pebble.Iterator.

intentInterleavingIter explicitly requests a clone, when it creates
two iterators, which ensures the consistency guarantee applies to
all Reader implementations.

Informs cockroachdb#55461
Informs cockroachdb#41720

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jan 7, 2021
For pebbleBatch and pebbleReadOnly, all iterators without timestamp
hints see the same underlying engine state, with no interface
change for the caller. This is done by cloning the first created
pebble.Iterator.

intentInterleavingIter explicitly requests a clone, when it creates
two iterators, which ensures the consistency guarantee applies to
all Reader implementations.

Informs cockroachdb#55461
Informs cockroachdb#41720

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jan 8, 2021
For pebbleBatch and pebbleReadOnly, all iterators without timestamp
hints see the same underlying engine state, with no interface
change for the caller. This is done by cloning the first created
pebble.Iterator.

intentInterleavingIter explicitly requests a clone, when it creates
two iterators, which ensures the consistency guarantee applies to
all Reader implementations.

Informs cockroachdb#55461
Informs cockroachdb#41720

Release note: None
craig bot pushed a commit that referenced this issue Jan 10, 2021
58515: storage: use consistent iterators when needed, and when possible r=sumeerbhola a=sumeerbhola

For pebbleBatch and pebbleReadOnly, all iterators without timestamp
hints see the same underlying engine state, with no interface
change for the caller. This is done by cloning the first created
pebble.Iterator.

intentInterleavingIter explicitly requests a clone, when it creates
two iterators, which ensures the consistency guarantee applies to
all Reader implementations.

Informs #55461
Informs #41720

Release note: None

58623: sql: implement SQL syntax for REGIONAL BY ROW ON col r=ajstorm a=otan

Implement the syntax and aliases for `REGIONAL BY ROW ON col`, which
return an unimplemented error as they are not yet implemented.

Release note: None

Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
@andreimatei
Copy link
Contributor

Note to self: evaluation now does use a snapshot, but the snapshot is captured lazily. This issue is asking for the snapshot to be captured eagerly so, for example, if we were to drop latches earlier, we'd be guaranteed to have captured a snapshot before that point.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 24, 2021
This is relevant for read evaluation cases where we want to release
latches before evaluation. The new Reader.PinEngineStateForIterators
method would be called before releasing latches.

This pinning does not apply to iterators with timestamp hints,
similar to how ConsistentIterators does not apply to them. So
this does not help ExportRequest evaluation for incremental backups.
To address this we would want a guarantee from the caller that the
timestamp hints will not change for the lifetime of the Reader that
is making a promise of ConsistentIterators.

Informs cockroachdb#55461,cockroachdb#66485

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 25, 2021
This is relevant for read evaluation cases where we want to release
latches before evaluation. The new Reader.PinEngineStateForIterators
method would be called before releasing latches.

This pinning does not apply to iterators with timestamp hints,
similar to how ConsistentIterators does not apply to them. So
this does not help ExportRequest evaluation for incremental backups.
To address this we would want a guarantee from the caller that the
timestamp hints will not change for the lifetime of the Reader that
is making a promise of ConsistentIterators.

Informs cockroachdb#55461,cockroachdb#66485

Release note: None
craig bot pushed a commit that referenced this issue Jun 28, 2021
66845: storage: add Reader method to pin iterators at current engine state r=sumeerbhola a=sumeerbhola

This is relevant for read evaluation cases where we want to release
latches before evaluation. The new Reader.PinEngineStateForIterators
method would be called before releasing latches.

This pinning does not apply to iterators with timestamp hints,
similar to how ConsistentIterators does not apply to them. So
this does not help ExportRequest evaluation for incremental backups.
To address this we would want a guarantee from the caller that the
timestamp hints will not change for the lifetime of the Reader that
is making a promise of ConsistentIterators.

Informs #55461,#66485

Release note: None

66885: sql: add ReType to resolveAndRequireType to fix vector engine panic r=cucaroach a=cucaroach

Fixes #66708

The vector engine needs exact type coercion, specifically when piping
computed column values into downstream operators.  Without this fix
the computed column was left as an int64 instead of cast back to the
required int16 type.

Exposed by sqlsmith, kudos to @michae2 for the reduce help

Release note: None

Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 30, 2021
This was originally introduced to work around limitations in the
`storage.Reader` interface, where only a `RocksDB` instance could
be passed to `engine.ExportToSst`. Since then, a lot has changed,
and this is no longer needed.

Removing this is important, as it appears to undermine cockroachdb#55461 and
make cockroachdb#66485 difficult.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 1, 2021
This change removes the Engine method from EvalContext. Removing this is
important, as its existence appears to undermine cockroachdb#55461 and make cockroachdb#66485
difficult.

The first place where this was used was in EndTxn's evaluation function.
I don't see any reason for this. In fact, we had a TODO to fix this, which
we could have addressed years ago.

The second place where this was used was in RecomputeStats's evaluation
function. There, it was used for two reasons. First, it was used because
`storage.Batch` used to not provide a consistent view of data. They now
do. It was also used to evade spanset assertions, which this commit
addresses in a better way.
craig bot pushed a commit that referenced this issue Jul 1, 2021
66917: kvcoord: assert sanity when tracking in-flight write r=andreimatei a=andreimatei

Release note: None

67093: kv: remove spanset.GetDBEngine r=nvanbenschoten a=nvanbenschoten

This was originally introduced to work around limitations in the
`storage.Reader` interface, where only a `RocksDB` instance could
be passed to `engine.ExportToSst`. Since then, a lot has changed,
and this is no longer needed.

Removing this is important, as it appears to undermine #55461 and
make #66485 difficult.

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
craig bot pushed a commit that referenced this issue Jul 6, 2021
67121: kv: remove EvalContext.Engine r=nvanbenschoten a=nvanbenschoten

This change removes the Engine method from EvalContext. Removing this is
important, as its existence appears to undermine #55461 and make #66485
difficult.

The first place where this was used was in EndTxn's evaluation function.
I don't see any reason for this. In fact, we had a TODO to fix this, which
we could have addressed years ago.

The second place where this was used was in RecomputeStats's evaluation
function. There, it was used for two reasons. First, it was used because
`storage.Batch` used to not provide a consistent view of data. They now
do. It was also used to evade spanset assertions, which this commit
addresses in a better way.

67146: c-deps: remove protobuf r=dt a=dt

Release note: none.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@blathers-crl blathers-crl bot added the T-kv KV Team label Dec 1, 2021
@aayushshah15 aayushshah15 self-assigned this Jan 18, 2022
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Feb 9, 2022
…tion

This commit makes it such that we eagerly pin the engine state of the `Reader`
created during the evaluation of read-only requests.

Generally, reads will hold latches throughout the course of their evaluation
(particularly, while they do their `MVCCScan`). Mainly, this commit paves the
way for us to move to a world where we avoid holding latches during the
MVCCScan. Additionally it also lets us make MVCC garbage collection latchless
as described in cockroachdb#55293.

There are a few notable changes in this patch:

1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the
closed timestamp of the `Replica` is now captured at the time the `EvalContext`
is created, and not during the command evaluation of
`QueryResolvedTimestampRequest`.

2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The
`ImmutableEvalContext` is supposed to encapsulate state that must not change
after the `EvalContext` is created. The closed timestamp of the replica is part
of the `ImmutableEvalContext`.

3. `Replica` no longer fully implements the `EvalContext` interface. Instead,
it implements everything but `GetClosedTimestamp()` (which is implemented by
`ImmutableEvalContext` instead).

Relates to cockroachdb#55293
Resolves cockroachdb#55461
Resolves cockroachdb#70974

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Feb 14, 2022
…tion

This commit makes it such that we eagerly pin the engine state of the `Reader`
created during the evaluation of read-only requests.

Generally, reads will hold latches throughout the course of their evaluation
(particularly, while they do their `MVCCScan`). Mainly, this commit paves the
way for us to move to a world where we avoid holding latches during the
MVCCScan. Additionally it also lets us make MVCC garbage collection latchless
as described in cockroachdb#55293.

There are a few notable changes in this patch:

1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the
closed timestamp of the `Replica` is now captured at the time the `EvalContext`
is created, and not during the command evaluation of
`QueryResolvedTimestampRequest`.

2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The
`ImmutableEvalContext` is supposed to encapsulate state that must not change
after the `EvalContext` is created. The closed timestamp of the replica is part
of the `ImmutableEvalContext`.

3. `Replica` no longer fully implements the `EvalContext` interface. Instead,
it implements everything but `GetClosedTimestamp()` (which is implemented by
`ImmutableEvalContext` instead).

Relates to cockroachdb#55293
Resolves cockroachdb#55461
Resolves cockroachdb#70974

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Mar 23, 2022
…tion

This commit makes it such that we eagerly pin the engine state of the `Reader`
created during the evaluation of read-only requests.

Today, reads will hold latches throughout the course of their evaluation
(particularly, while they do their `MVCCScan`). This commit paves the
way for us to move to a world where we avoid holding latches during the
MVCCScan. Additionally it also lets us make MVCC garbage collection latchless
as described in cockroachdb#55293.

There are a few notable changes in this patch:

1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the
closed timestamp of the `Replica` is now captured at the time the `EvalContext`
is created, and not during the command evaluation of
`QueryResolvedTimestampRequest`.

2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The
`ImmutableEvalContext` is supposed to encapsulate state that must not change
after the `EvalContext` is created. The closed timestamp of the replica is part
of the `ImmutableEvalContext`.

3. `Replica` no longer fully implements the `EvalContext` interface. Instead,
it implements everything but `GetClosedTimestamp()` (which is implemented by
`ImmutableEvalContext` instead).

Relates to cockroachdb#55293
Resolves cockroachdb#55461
Resolves cockroachdb#70974

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Mar 26, 2022
…tion

This commit makes it such that we eagerly pin the engine state of the `Reader`
created during the evaluation of read-only requests.

Today, reads will hold latches throughout the course of their evaluation
(particularly, while they do their `MVCCScan`). This commit paves the
way for us to move to a world where we avoid holding latches during the
MVCCScan. Additionally it also lets us make MVCC garbage collection latchless
as described in cockroachdb#55293.

There are a few notable changes in this patch:

1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the
closed timestamp of the `Replica` is now captured at the time the `EvalContext`
is created, and not during the command evaluation of
`QueryResolvedTimestampRequest`.

2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The
`ImmutableEvalContext` is supposed to encapsulate state that must not change
after the `EvalContext` is created. The closed timestamp of the replica is part
of the `ImmutableEvalContext`.

3. `Replica` no longer fully implements the `EvalContext` interface. Instead,
it implements everything but `GetClosedTimestamp()` (which is implemented by
`ImmutableEvalContext` instead).

Relates to cockroachdb#55293
Resolves cockroachdb#55461
Resolves cockroachdb#70974

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Mar 29, 2022
…tion

This commit makes it such that we eagerly pin the engine state of the `Reader`
created during the evaluation of read-only requests.

Today, reads will hold latches throughout the course of their evaluation
(particularly, while they do their `MVCCScan`). This commit paves the
way for us to move to a world where we avoid holding latches during the
MVCCScan. Additionally it also lets us make MVCC garbage collection latchless
as described in cockroachdb#55293.

There are a few notable changes in this patch:

1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the
closed timestamp of the `Replica` is now captured at the time the `EvalContext`
is created, and not during the command evaluation of
`QueryResolvedTimestampRequest`.

2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The
`ImmutableEvalContext` is supposed to encapsulate state that must not change
after the `EvalContext` is created. The closed timestamp of the replica is part
of the `ImmutableEvalContext`.

3. `Replica` no longer fully implements the `EvalContext` interface. Instead,
it implements everything but `GetClosedTimestamp()` (which is implemented by
`ImmutableEvalContext` instead).

Relates to cockroachdb#55293
Resolves cockroachdb#55461
Resolves cockroachdb#70974

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Mar 31, 2022
…tion

This commit makes it such that we eagerly pin the engine state of the `Reader`
created during the evaluation of read-only requests.

Today, reads will hold latches throughout the course of their evaluation
(particularly, while they do their `MVCCScan`). This commit paves the
way for us to move to a world where we avoid holding latches during the
MVCCScan. Additionally it also lets us make MVCC garbage collection latchless
as described in cockroachdb#55293.

There are a few notable changes in this patch:

1. Pinning the engine state eagerly runs into cockroachdb#70974. To resolve this, the
closed timestamp of the `Replica` is now captured at the time the `EvalContext`
is created, and not during the command evaluation of
`QueryResolvedTimestampRequest`.

2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The
`ImmutableEvalContext` is supposed to encapsulate state that must not change
after the `EvalContext` is created. The closed timestamp of the replica is part
of the `ImmutableEvalContext`.

3. `Replica` no longer fully implements the `EvalContext` interface. Instead,
it implements everything but `GetClosedTimestamp()` (which is implemented by
`ImmutableEvalContext` instead).

Relates to cockroachdb#55293
Resolves cockroachdb#55461
Resolves cockroachdb#70974

Release note: None
craig bot pushed a commit that referenced this issue Apr 5, 2022
…79409 #79427 #79428 #79433 #79444

76312: kvserver, batcheval: pin Engine state during read-only command evaluation r=aayushshah15 a=aayushshah15

This commit makes it such that we eagerly pin the engine state of the `Reader`
created during the evaluation of read-only requests.

Generally, reads will hold latches throughout the course of their evaluation
(particularly, while they do their `MVCCScan`). Mainly, this commit paves the
way for us to move to a world where we avoid holding latches during the
MVCCScan. Additionally, it also lets us make MVCC garbage collection latchless
as described in #55293.

There are a few notable changes in this patch:

1. Pinning the engine state eagerly runs into #70974. To resolve this, the
closed timestamp of the `Replica` is now captured at the time the `EvalContext`
is created, and not during the command evaluation of
`QueryResolvedTimestampRequest`.

2. `EvalContext` now has a `ImmutableEvalContext` embedded into it. The
`ImmutableEvalContext` is supposed to encapsulate state that must not change
after the `EvalContext` is created. The closed timestamp of the replica is part
of the `ImmutableEvalContext`.

3. `Replica` no longer fully implements the `EvalContext` interface. Instead,
it implements everything but `GetClosedTimestamp()` (which is implemented by
`ImmutableEvalContext` instead).

Relates to #55293
Resolves #55461
Resolves #70974

Release note: None


78652: sql: implement to_reg* builtins r=otan a=e-mbrown

Resolves #77838
This commit implements the `to_regclass`, `to_regnamespace`, `to_regproc`,
`to_regprocedure`, `to_regrole`, and `to_regtype` builtins.

Release note (<category, see below>): The `to_regclass`, `to_regnamespace`, `to_regproc`,
`to_regprocedure`, `to_regrole`, and `to_regtype` builtin functions are now supported,
improving compatibility with PostgreSQL.

79022: server/status: add running non-idle jobs metric r=darinpp a=darinpp

Previously serverless was using the sql jobs running metric to determine
if a tenant process is idle and can be shut down. With the introduction
of continiously running jobs this isn't a good indicator anymore. A
recent addition is a per job metrics that show running or idle. The auto
scaler doesn't care about the individual jobs and only cares about the
total number of jobs that a running but haven't reported as being idle.
The pull rate is also very high so the retriving all the individual
running/idle metrics for each job type isn't optimal. So this PR adds a
single metric that just aggregates and tracks the total count of jobs
running and not idle.

Release justification: Bug fixes and low-risk updates to new functionality
Release note: None

Will be re-based once #79021 is merged

79157: cli: tweak slow decommission message r=knz a=cameronnunez

Release note: None

79313: opt: do not push LIMIT into the scan of a virtual table r=msirek a=msirek

Fixes #78578

Previously, a LIMIT operation could be pushed into the scan of a virtual
table with an ORDER BY clause.              

This was inadequate because in-order scans of virtual indexes aren't
supported. When an index that should provide the order requested by a
query is used, a sort is actually produced under the covers:
```
EXPLAIN(vec)
SELECT oid, typname FROM pg_type ORDER BY OID;
               info
----------------------------------
  │
  └ Node 1
    └ *colexec.sortOp
      └ *sql.planNodeToRowSource

```
Functions `CanLimitFilteredScan` and `GenerateLimitedScans` are modified
to avoid pushing LIMIT operations into ordered scans of virtual indexes. 

Release justification: Low risk fix for incorrect results in queries
involving virtual system tables.

Release note (bug fix): LIMIT queries with an ORDER BY clause which scan
the index of a virtual system tables, such as `pg_type`, could
previously return incorrect results. This is corrected by teaching the
optimizer that LIMIT operations cannot be pushed into ordered scans of
virtual indexes.


79346: ccl/sqlproxyccl: add rebalancer queue for connection rebalancing r=JeffSwenson a=jaylim-crl

#### ccl/sqlproxyccl: add rebalancer queue for rebalance requests 

This commit adds a rebalancer queue implementation to the balancer component.
The queue will be used for rebalance requests for the connection migration
work. This is done to ensure a centralized location that invokes the
TransferConnection method on the connection handles. Doing this also enables
us to limit the number of concurrent transfers within the proxy.

Release note: None

#### ccl/sqlproxyccl: run rebalancer queue processor in the background 

The previous commit added a rebalancer queue. This commit connects the queue to
the balancer, and runs the queue processor in the background. By the default,
we limit up to 100 concurrent transfers at any point in time, and each transfer
will be retried up to 3 times.

Release note: None

Jira issue: CRDB-14727

79362: kv: remove stale comment in processOneChange r=nvanbenschoten a=nvanbenschoten

The comment was added in 2fb56bd and hasn't been accurate since 5178559.

Jira issue: CRDB-14753

79368: ccl/sqlproxyccl: include DRAINING pods in the directory cache r=JeffSwenson a=jaylim-crl

Previously, #67452 removed DRAINING pods from the directory cache. This commit
adds that back. The connector will now need to filter for RUNNING pods manually
before invoking the balancer. This is needed so that we could track DRAINING
pods, and wait until 60 seconds has elapsed before transferring connections
away from them. To support that, we also update the Pod's proto definition to
include a StateTimestamp field to reprevent that timestamp that the state field
was last updated.

The plan is to have a polling mechanism every X seconds to check DRAINING pods,
and use that information to start migrating connections.

Release note: None

Jira issue: CRDB-14759

79386: colexec: remove redundant benchmarks r=yuzefovich a=yuzefovich

This commit finishes the transition of some of the benchmarks in the
colexec package started in 22.1 cycle.

Fixes: #75106.

Release note: None

Jira issue: CRDB-14783

79409: sql: refactor deps tests to use bazel r=yuzefovich a=yuzefovich

This commit refactors most `VerifyNoImports` dependency tests in the sql
folder to use the newly introduced bazel test utilities.

Release note: None

Jira issue: CRDB-14814

79427: backupccl: allow cluster restore from different tenant r=dt a=stevendanna

This removes a prohibition for cluster restores with mismatched tenant
IDs since we believe they are now correct as of #73831

This allows users to take a cluster backup in a tenant and restore it
into another tenant.

The new tenant_settings table needs special care since it may exist in
the source tenant but not the target tenant when the source tenant is
the system tenant.

In this change, we throw an error in the case of a non-empty
tenant_settings table being restored into a non-system tenant. This is
a bit user-unfriendly since we detect this error rather late in the
restore process.

Release note: None

Jira issue: CRDB-14844

79428: backupccl: Refactor encryption utility functions into their own file. r=benbardin a=benbardin

Release note: None

Jira issue: CRDB-14845

79433: sql: use new ALTER TENANT syntax in tests r=stetvendanna a=rafiss

Release note: None

79444: roachtest: warmup follower-reads for fixed duration, not fixed number of ops r=nvanbenschoten a=nvanbenschoten

Fixes #78596.

This change switches the warmup phase of the follower-read roachtest
suite from running a fixed number of operations (100) to running for a
fixed duration (15s). This should ensure that the single-region variant
of the test is given sufficient time to warm up follower reads immediately
after one of its nodes is restarted.

Before this change, the single-region variant was only being given about
500ms after startup to catch up on the closed timestamp, which made the
test flaky.

Release justification: testing only

Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: e-mbrown <[email protected]>
Co-authored-by: Darin Peshev <[email protected]>
Co-authored-by: Cameron Nunez <[email protected]>
Co-authored-by: Mark Sirek <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Ben Bardin <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig craig bot closed this as completed in 0a54b3d Apr 5, 2022
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue May 13, 2022
We allow a key (with multiple versions) to have multiple intents, under the
condition that at most one of the intents is uncommitted.

To aid this behavior we introduce a commitMap, that maintains a logical set
of TxnIDs of transactions that were "simple" in their behavior, and have
committed, where simple is defined as all the following conditions:
- No savepoint rollbacks: len(intent.IgnoredSeqNums)==0
- Single epoch, i.e., TxnMeta.Epoch==0
- Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp
For such transactions, their provisional values can be considered committed
with the current version and local timestamp, i.e., we need no additional
information about the txn other than the TxnID.

Adding to commitMap:
The earliest a txn can be added to the commitMap is when the transaction is
in STAGING and has verified all the conditions for commit. That is, this
can be done before the transition to COMMITTED is replicated. For the node
that has the leaseholder of the range containing the txn record, this
requires no external communication. For other nodes with intents for the
txn, one could piggyback this information on the RPCs for async intent
resolution, and add to the the commitMap before doing the intent resolution
-- this piggybacking would incur 2 consensus rounds of contention. If we
are willing to send the RPC earlier, it will be contention for 1 consensus
round only. Note that these RPCs should also remove locks from the
non-persistent lock table data-structure so should send information about
the keys (like in a LockUpdate but won't remove the durable lock state).

Removal from commitMap:
The commitMap can be considered a cache of TxnIDs. It is helpful to have a
txn in the cache until its intents have been resolved. Additionally,
latchless intent resolution must pin a txn in the map before it releases
latches and unpin when the intent resolution has been applied on the
leaseholder. This pinning behavior is needed to ensure the correctness of
the in-memory concurrency.lockTable, which must maintain the property that
the replicated locks known to it are a subset of the persistent replicated
locks. We are assuming here that there is no lockTable on followers.

Why "simple-committed":
- We don't want to have to coordinate intent resolution of these multiple
  intents, by mandating that the resolution happen in any particular order.
- We want to guarantee that even if the commitMap is cleared (since it is a
  cache), we can maintain the invariant that a caller iterating over a key
  sees at most one intent. As we will illustrate below, providing this
  guarantee requires us to limit the commitMap to only contain
  simple-committed txns.

Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and
intents for t5, t4, t2, with corresponding txns txn5, ... txn1. We consider
the disposition of an intent to be either unknown or simple-committed. In
this example, the disposition of the intent for t4 and t2 is
simple-committed solely based on the fact that there is at least one
version (provisional or otherwise) more recent that the timestamp of the
intent. That is, at most one intent, the one for t5, has a disposition that
needs to rely on knowledge that is not self-contained in the history. For
t5, we must rely on the commitMap to decide whether is unknown or
simple-committed. It is possible that some user of the
intentInterleavingIter saw t5 as simple-committed and a later user sees it
as unknown disposition, if the txn5 got removed from the commitMap -- such
regression is harmless since the latter user will simply have to do intent
resolution. Note that the intent for t5 could get resolved before those for
t4 and t2, and that is also fine since the disposition of t4, t2 stays
simple-committed. If txn5 is aborted and the intent for t5 removed, and
txn4 is no longer in the commitMap, the disposition of t4 could change to
unknown. This is also acceptable, since t5 was only serving as a "local
promise" that t4 was committed, which is simply an optimization. There is
still a less efficient globally available "promise" that t4 is committed,
and intent resolution of t4 is how we will enact that promise.

Maintaining the above guarantees requires that historical versions must not
be garbage collected without resolving intents. This is acceptable since GC
is not latency sensitive.

This PR only introduces the changes for the intentInterleavingIter.
It excludes:
- A sophisticated commitMap data-structure. There are code comments
  sketching out what properties are desirable.
- The iterForKeyVersions implementation used for intent resolution,
  now that we can have multiple intents. This will be straightforward.
- The changes for latchless intent resolution. These should be
  straightforward once we have a commitMap with pinning support and we
  extend cockroachdb#55461 to use a consistent storage snapshot for intent resolution.
- The KV layer changes to send lists of intent keys for simple-committed
  txns to the various ranges, so that they can add to their commitMap and
  remove from the in-memory lockTable.

Informs cockroachdb#66867

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants