-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: divergent ideas about whether a txn is "read-only" between the Txn and the txnIntentCollector #28256
Labels
A-kv-client
Relating to the KV client and the KV interface.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
S-3-erroneous-edge-case
Database produces or stores erroneous data without visible error/warning, in rare edge cases.
Milestone
Comments
andreimatei
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-kv-client
Relating to the KV client and the KV interface.
S-3-erroneous-edge-case
Database produces or stores erroneous data without visible error/warning, in rare edge cases.
labels
Aug 3, 2018
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Aug 3, 2018
This patch moves most of the logic from the client.Txn into the kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the process. The split between the client.Txn and the TxnCoordSender caused a lot of grief historically. The main problem is that both the Txn and the TCS each have their own copy of the roachpb.Transaction proto. They both use their copy for different things. We attempt to keep the two protos in sync, but we can't ensure that as there's no common locking between the two layers. This patch keeps the client.Txn as a mostly stateless shim, allowing one to mock everything underneath. This is nice, as previously "mocking KV" was a less clear proposition - does one mock all the logic in the Txn or just the TCS? Now the TCS has all the logic and all the locking necessary for serializing accesses to the "transaction state" - notably the proto. The Txn and TCS communicate through a (now expanded) client.TxnSender interface. Within the TCS, the biggest change is that everything that has to do with the heartbeat loop has been moved to a new interceptor. Fixes cockroachdb#28256 Release note: none
I've done something in #28185 - the gatekeeper is now in charge of tracking whether we ever sent a BeginTxn. Wasn't that bad. |
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Aug 7, 2018
This patch moves most of the logic from the client.Txn into the kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the process. The split between the client.Txn and the TxnCoordSender caused a lot of grief historically. The main problem is that both the Txn and the TCS each have their own copy of the roachpb.Transaction proto. They both use their copy for different things. We attempt to keep the two protos in sync, but we can't ensure that as there's no common locking between the two layers. This patch keeps the client.Txn as a mostly stateless shim, allowing one to mock everything underneath. This is nice, as previously "mocking KV" was a less clear proposition - does one mock all the logic in the Txn or just the TCS? Now the TCS has all the logic and all the locking necessary for serializing accesses to the "transaction state" - notably the proto. The Txn and TCS communicate through a (now expanded) client.TxnSender interface. Within the TCS, the biggest change is that everything that has to do with the heartbeat loop has been moved to a new interceptor. The metrics generation has also been extracted into a new interceptor. One behavior change introduced by this patch is that heartbeat loops are no longer started for (what the TCS hopes will be) 1PC txns. The motivation was concern over the price of spawning a (shortlived) heartbeat goroutine per txn in the 1PC-heavy "kv" workload. Another one is that the TxnCoordSender doesn't inherit the old Txn logic for swallowing errors on rollbacks. Instead, we're relying on a recent server change to not return errors on rollbacks when the txn record is missing - which was the reason for said swallowing. Fixes cockroachdb#28256 Release note: none
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Aug 8, 2018
This patch moves most of the logic from the client.Txn into the kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the process. The split between the client.Txn and the TxnCoordSender caused a lot of grief historically. The main problem is that both the Txn and the TCS each have their own copy of the roachpb.Transaction proto. They both use their copy for different things. We attempt to keep the two protos in sync, but we can't ensure that as there's no common locking between the two layers. This patch keeps the client.Txn as a mostly stateless shim, allowing one to mock everything underneath. This is nice, as previously "mocking KV" was a less clear proposition - does one mock all the logic in the Txn or just the TCS? Now the TCS has all the logic and all the locking necessary for serializing accesses to the "transaction state" - notably the proto. The Txn and TCS communicate through a (now expanded) client.TxnSender interface. Within the TCS, the biggest change is that everything that has to do with the heartbeat loop has been moved to a new interceptor. The metrics generation has also been extracted into a new interceptor. One behavior change introduced by this patch is that heartbeat loops are no longer started for (what the TCS hopes will be) 1PC txns. The motivation was concern over the price of spawning a (shortlived) heartbeat goroutine per txn in the 1PC-heavy "kv" workload. Another one is that the TxnCoordSender doesn't inherit the old Txn logic for swallowing errors on rollbacks. Instead, we're relying on a recent server change to not return errors on rollbacks when the txn record is missing - which was the reason for said swallowing. Fixes cockroachdb#28256 Release note: none
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Aug 9, 2018
This patch moves most of the logic from the client.Txn into the kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the process. The split between the client.Txn and the TxnCoordSender caused a lot of grief historically. The main problem is that both the Txn and the TCS each have their own copy of the roachpb.Transaction proto. They both use their copy for different things. We attempt to keep the two protos in sync, but we can't ensure that as there's no common locking between the two layers. This patch keeps the client.Txn as a mostly stateless shim, allowing one to mock everything underneath. This is nice, as previously "mocking KV" was a less clear proposition - does one mock all the logic in the Txn or just the TCS? Now the TCS has all the logic and all the locking necessary for serializing accesses to the "transaction state" - notably the proto. The Txn and TCS communicate through a (now expanded) client.TxnSender interface. Within the TCS, the biggest change is that everything that has to do with the heartbeat loop has been moved to a new interceptor. The metrics generation has also been extracted into a new interceptor. One behavior change introduced by this patch is that heartbeat loops are no longer started for (what the TCS hopes will be) 1PC txns. The motivation was concern over the price of spawning a (shortlived) heartbeat goroutine per txn in the 1PC-heavy "kv" workload. Another one is that the TxnCoordSender doesn't inherit the old Txn logic for swallowing errors on rollbacks. Instead, we're relying on a recent server change to not return errors on rollbacks when the txn record is missing - which was the reason for said swallowing. Fixes cockroachdb#28256 Release note: none
craig bot
pushed a commit
that referenced
this issue
Aug 9, 2018
28185: client, kv: move logic out of Txn, rewrite some of the TxnCoordSender r=andreimatei a=andreimatei This patch moves most of the logic from the client.Txn into the kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the process. The split between the client.Txn and the TxnCoordSender caused a lot of grief historically. The main problem is that both the Txn and the TCS each have their own copy of the roachpb.Transaction proto. They both use their copy for different things. We attempt to keep the two protos in sync, but we can't ensure that as there's no common locking between the two layers. This patch keeps the client.Txn as a mostly stateless shim, allowing one to mock everything underneath. This is nice, as previously "mocking KV" was a less clear proposition - does one mock all the logic in the Txn or just the TCS? Now the TCS has all the logic and all the locking necessary for serializing accesses to the "transaction state" - notably the proto. The Txn and TCS communicate through a (now expanded) client.TxnSender interface. Within the TCS, the biggest change is that everything that has to do with the heartbeat loop has been moved to a new interceptor. The metrics generation has also been extracted into a new interceptor. One behavior change introduced by this patch is that heartbeat loops are no longer started for (what the TCS hopes will be) 1PC txns. The motivation was concern over the price of spawning a (shortlived) heartbeat goroutine per txn in the 1PC-heavy "kv" workload. Another one is that the TxnCoordSender doesn't inherit the old Txn logic for swallowing errors on rollbacks. Instead, we're relying on a recent server change to not return errors on rollbacks when the txn record is missing - which was the reason for said swallowing. Fixes #28256 Release note: none Co-authored-by: Andrei Matei <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-client
Relating to the KV client and the KV interface.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
S-3-erroneous-edge-case
Database produces or stores erroneous data without visible error/warning, in rare edge cases.
The intent collector interceptor complains if it receives an
EndTransaction
without it having collected any intents prior. It returns an error saying that it's a "readonly txn" and theEndTxn
should have been elided above it. That error is supposed to signal bugs.cockroach/pkg/kv/txn_interceptor_intent_collector.go
Line 102 in 8044dea
Unfortunately, that assumption is not quite copacetic with what the layers above it do. There's currently two problems:
Txn
moves to thetxnError
error, at which point it forgets if it ever performed any writes (i.e. if it came from statetxnReadonly
ortxnWriting
, etc..), and so it will never elide futureEndTransactions
(rollbacks).BeginTxn
batch is rejected below theTxn
layer, in theTxnCoordSender
, then theclient.Txn
considers the transaction to be writing (cause it sent aBeginTxn
), but the intent collector considers it read-only (cause it never saw saidBeginTxn
). This happens for example if theStopper
is stopped when theBeginTransaction
is sent, and theTCS
fails to start the heartbeat loop and rejects the batch.The first problem goes away in #28185 because that PR brakes apart the different txn states and correctly tracks if a
BeginTxn
was ever sent.The second problem is more fundamental, caused by the separate tracking of the
BeginTxn
done in both the TCS and the intent collector.It's not very clear to me what to do about it. We could try to either share the "did we send a BeginTxn" state. This is a bit complicated by the fact that, in #28185, the BeginTxn tracking is not done by the TCS directly, but by the heartbeat interceptor. So we'd need to create a communication channel between two interceptors. I think what we'd want is for the
txnLockGatekeeper
to keep track of whether aBeginTxn
is really sent to the server, and for the heartbeat interceptor to use that to dictate whetherEndTransaction
can be elided. Then any interceptor anywhere in the stack can retain the right to reject batches (as they tend to do already) and the intent collector can be left alone with the current assumptions - that if it sees and EndTxn it must have accumulated some intents.Alternatively, the intent collector could get its own logic for eliding the end transaction, duplicating the existing one.
Yet alternatively, the intent collector could stop assuming anything and conservatively forward unnecessary EndTxns.
@nvanbenschoten for thoughts, if any.
Thanks @jordanlewis for seeing something and saying something.
The error can be demonstrated with
The text was updated successfully, but these errors were encountered: