-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,client,storage: rationalize TxnCoordSender/client.Txn redundant states #25541
Conversation
26987a5
to
033c6e8
Compare
LGTM, but I'm concerned about how this will work in mixed-version clusters. Is it safe to just make the change? (It might be. It's hard to tell how much we were relying on the server-set Writing flag before) Reviewed 19 of 20 files at r1, 3 of 3 files at r2. pkg/internal/client/txn.go, line 658 at r2 (raw file):
Should we just reuse the pkg/sql/sem/tree/stmt.go, line 652 at r2 (raw file):
Removing this from pkg/storage/batcheval/cmd_begin_transaction.go, line 134 at r2 (raw file):
I don't think it's safe to remove this without a cluster version check. Won't 2.0 nodes rely on this flag being set in BeginTransaction? Comments from Reviewable |
033c6e8
to
ba5b044
Compare
I think you're right; I've added back the server-side code setting that flag gated by a new version not being active. Review status: 17 of 26 files reviewed at latest revision, 3 unresolved discussions. pkg/internal/client/txn.go, line 658 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
ok, shuffled things around and this is gone pkg/sql/sem/tree/stmt.go, line 652 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
ok, added a commit removing most of these pkg/storage/batcheval/cmd_begin_transaction.go, line 134 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
put it back behind a version check Comments from Reviewable |
ce6d15b
to
1077464
Compare
LGTM (mod comments), since it now looks like you're handling things correctly with mixed-version clusters. Reviewed 16 of 20 files at r1, 10 of 10 files at r3, 3 of 3 files at r4, 1 of 1 files at r5. pkg/internal/client/sender.go, line 112 at r3 (raw file):
I'd keep this as is and add a second pkg/internal/client/txn.go, line 72 at r3 (raw file):
Did you look into merging this status into the pkg/internal/client/txn.go, line 952 at r3 (raw file):
pkg/internal/client/txn.go, line 953 at r3 (raw file):
nit: flip this around so that it's easier to read and compare to the previous line:
pkg/kv/dist_sender_server_test.go, line 2529 at r3 (raw file):
? pkg/kv/txn_coord_sender.go, line 979 at r3 (raw file):
Why do we suddenly need this? pkg/kv/txn_coord_sender.go, line 1288 at r3 (raw file):
💯 pkg/kv/txn_coord_sender.go, line 1314 at r3 (raw file):
:101: pkg/roachpb/batch.go, line 409 at r3 (raw file):
Will an EndTransactionRequest ever have an pkg/roachpb/data.proto, line 338 at r3 (raw file):
This first sentence is no longer accurate. Comments from Reviewable |
4236eb0
to
c3956be
Compare
Review status: 22 of 26 files reviewed at latest revision, 13 unresolved discussions. pkg/internal/client/sender.go, line 112 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
renamed to pkg/internal/client/txn.go, line 72 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I looked now. Unfortunately this sucker doesn't match with any of the states. It means "has any request ever been sent". It's not a very important field... I'll leave it for another cleanup :) pkg/internal/client/txn.go, line 952 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/internal/client/txn.go, line 953 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/kv/dist_sender_server_test.go, line 2529 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/kv/txn_coord_sender.go, line 979 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think we always needed it. I forgot why I added it here exactly; it could be that I just noticed some error while debugging. pkg/roachpb/batch.go, line 409 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/roachpb/data.proto, line 338 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. Comments from Reviewable |
Review status: 12 of 26 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed. pkg/internal/client/sender.go, line 112 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Sorry, I also meant that you should leave Comments from Reviewable |
4678aba
to
17341eb
Compare
I've hit a snag here, surfaced by TestClientRunConcurrentTransaction:
Everything here was true before this PR too, up until the move to the One question I have for the group is: is it surprising that the Review status: 8 of 31 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed. Comments from Reviewable |
I think TransactionReplayError is special and should not move the txn to the TxnError state. We make this an error to guard against BeginTransactions being reordered after a committing EndTransaction, but in this case we know that there could not have been a committing ET (with matching epoch) because only this client.Txn could have sent it. Therefore we can safely restart the txn (at a higher timestamp) without going into the error state. |
2ca91a5
to
2af8758
Compare
The problem shown here by the @nvanbenschoten PTAL at the first commit. Review status: 7 of 34 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed. Comments from Reviewable |
2af8758
to
7b04474
Compare
Added one more commit that I was telling you about @nvanbenschoten for binding the TCS to a txn early. PTAL Review status: 0 of 33 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. Comments from Reviewable |
As I was telling Nathan, I've made a change to r11 adding copying of a few fields from a requests's Txn to the TxnCoordSender's copy of the proto, which proved to be necessary. This is to compensate for the effects of the previous cloning of the requests's proto on the first Review status: 0 of 34 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending. Comments from Reviewable |
bors r+ Review status: 0 of 34 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. Comments from Reviewable |
25541: kv,client,storage: rationalize TxnCoordSender/client.Txn redundant states r=andreimatei a=andreimatei client.Txn and TCS try to maintain a bunch of logically-redundant state about whether a transaction is "writing" - essentially whether an EndTransaction needs to be sent to cleanup up the TCS heartbeat loop and the server's txn record. The logic that both parties used for this was complex (e.g. it involved updates in both Txn and TCS both on the outgoing path and on the returning path of a batch) and not in sync - sometimes the TCS would consider the txn as "writing" and the client.Txn wouldn't (e.g. in case the first writing batch got an ambiguous error). This patch simplifies things: the idea is that, if a BeginTxn has been sent, an EndTransaction needs to be sent, period. The client.Txn thus only keeps track of whether a BeginTxn was sent (except for a 1PC batch), and it takes charge of starting the TCS' heartbeat loop (by instructing it explicitly directly to start it before the BeginTxn is sent). The TCS is no longer burdened with maintaining any state about whether there is a txn record or not. As a byproduct, the proto Transaction.Writing flag, which used to have an unclear meaning, becomes straight forward: if set, the server needs to check batches against the abort cache. The client is the only one setting it, the server is the only one checking it. It used to be used for different purposeses by both the client and server. Release note: none Co-authored-by: Andrei Matei <[email protected]>
Build succeeded |
TestTxnCoordSenderRetries is flaky, presumably as a consequence of merging cockroachdb#25541. A subtest is sometimes failing because a CommandFilter that it's using is inadvertently injecting a failure into a lingering request from a previous subtest. Don't know why that PR changed any behavior yet, but it seems more likely that this is a test issue than issue warranting a rollback, so skipping while I figure it out. Touches cockroachdb#26434 Release note: None
26442: kv: skip TestTxnCoordSenderRetries r=andreimatei a=andreimatei TestTxnCoordSenderRetries is flaky, presumably as a consequence of merging #25541. A subtest is sometimes failing because a CommandFilter that it's using is inadvertently injecting a failure into a lingering request from a previous subtest. Don't know why that PR changed any behavior yet, but it seems more likely that this is a test issue than issue warranting a rollback, so skipping while I figure it out. Touches #26434 Release note: None Co-authored-by: Andrei Matei <[email protected]>
A txn's heartbeat loop is generally stopped when, upon a successful request, the response's txn is no longer PENDING. This was insufficient; the loop should always be closed after an EndTransaction(commit=false), regardless of whether it results in success or error. The heartbeat loop happens to be currently closed by the context cancelation that the db.Txn() API performs, but that is going away. This fixes cockroachdb#26434 - TestTxnCoordSenderRetries had become flaky after cockroachdb#25541 because cockroachdb#25441 caused EndTransactions to be sent in some situations where they weren't before. What's going on in that test is that a subtest was leaking a heartbeat loop that was stopped after the subtest finished; the EndTxn sent when the loop finally was being stopped was interfering with a CommandFilter installed by a different subtest. Before cockroachdb#25441, the first subtest was waiting for the heartbeat loop to be done because of its own CommandFilter. However, with cockroachdb#25441, the first subtest's CommandFilter was being satisfied by a different, newly introduced EndTransaction. Release note: None
A txn's heartbeat loop is generally stopped when, upon a successful request, the response's txn is no longer PENDING. This was insufficient; the loop should always be closed after an EndTransaction(commit=false), regardless of whether it results in success or error. The heartbeat loop happens to be currently closed by the context cancelation that the db.Txn() API performs, but that is going away. This fixes cockroachdb#26434 - TestTxnCoordSenderRetries had become flaky after cockroachdb#25541 because cockroachdb#25441 caused EndTransactions to be sent in some situations where they weren't before. What's going on in that test is that a subtest was leaking a heartbeat loop that was stopped after the subtest finished; the EndTxn sent when the loop finally was being stopped was interfering with a CommandFilter installed by a different subtest. Before cockroachdb#25441, the first subtest was waiting for the heartbeat loop to be done because of its own CommandFilter. However, with cockroachdb#25441, the first subtest's CommandFilter was being satisfied by a different, newly introduced EndTransaction. Release note: None
The recent cockroachdb#25541 changed the way "tracking" (the heartbeat loop and intent collection) is initiated for transactions. It aimed to simplify things and put the burden on the client to decide when a txn needs tracking. This introduced a problem - the client.Txn was not initiating tracking when sending 1PC batches. However, tracking is needed for these transactions too: even though usually they'll succeed and so the TxnCoordSender state can be quickly destroyed, when they fail their intents and heartbeat loop need to be kept around just like for any other txn. This patch backtracks on the previous move to make it the client's responsibility to initiate tracking (it didn't stand the test of time): the client.Txn is no longer in charge of calling tcs.StartTracking(). Instead, the TCS does whatever needs to be done when it sees an EndTransaction. I also took the opportunity to spruce up comments on the TxnCoordSender. Release note: None
A txn's heartbeat loop is generally stopped when, upon a successful request, the response's txn is no longer PENDING. This was insufficient; the loop should always be closed after an EndTransaction(commit=false), regardless of whether it results in success or error. The heartbeat loop happens to be currently closed by the context cancelation that the db.Txn() API performs, but that is going away. This fixes cockroachdb#26434 - TestTxnCoordSenderRetries had become flaky after cockroachdb#25541 because cockroachdb#25441 caused EndTransactions to be sent in some situations where they weren't before. What's going on in that test is that a subtest was leaking a heartbeat loop that was stopped after the subtest finished; the EndTxn sent when the loop finally was being stopped was interfering with a CommandFilter installed by a different subtest. Before cockroachdb#25441, the first subtest was waiting for the heartbeat loop to be done because of its own CommandFilter. However, with cockroachdb#25441, the first subtest's CommandFilter was being satisfied by a different, newly introduced EndTransaction. Release note: None
26479: kv: stop the heartbeat loop on rollback errors r=andreimatei a=andreimatei A txn's heartbeat loop is generally stopped when, upon a successful request, the response's txn is no longer PENDING. This was insufficient; the loop should always be closed after an EndTransaction(commit=false), regardless of whether it results in success or error. The heartbeat loop happens to be currently closed by the context cancelation that the db.Txn() API performs, but that is going away. This fixes #26434 - TestTxnCoordSenderRetries had become flaky after #25541 because #25441 caused EndTransactions to be sent in some situations where they weren't before. What's going on in that test is that a subtest was leaking a heartbeat loop that was stopped after the subtest finished; the EndTxn sent when the loop finally was being stopped was interfering with a CommandFilter installed by a different subtest. Before #25441, the first subtest was waiting for the heartbeat loop to be done because of its own CommandFilter. However, with #25441, the first subtest's CommandFilter was being satisfied by a different, newly introduced EndTransaction. Release note: None 26516: sql: disable optimizer if force-lookup-joins flag set r=RaduBerinde a=RaduBerinde The experimental flag to force lookup joins doesn't work with the optimizer (it is disabled on opt-generated plans). To allow the flag to still function for now, we disable the optimizer if the flag is set. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
The recent cockroachdb#25541 changed the way "tracking" (the heartbeat loop and intent collection) is initiated for transactions. It aimed to simplify things and put the burden on the client to decide when a txn needs tracking. This introduced a problem - the client.Txn was not initiating tracking when sending 1PC batches. However, tracking is needed for these transactions too: even though usually they'll succeed and so the TxnCoordSender state can be quickly destroyed, when they fail their intents and heartbeat loop need to be kept around just like for any other txn. This patch backtracks on the previous move to make it the client's responsibility to initiate tracking (it didn't stand the test of time): the client.Txn is no longer in charge of calling tcs.StartTracking(). Instead, the TCS does whatever needs to be done when it sees an EndTransaction. I also took the opportunity to spruce up comments on the TxnCoordSender. Release note: None
The recent cockroachdb#25541 changed the way "tracking" (the heartbeat loop and intent collection) is initiated for transactions. It aimed to simplify things and put the burden on the client to decide when a txn needs tracking. This introduced a problem - the client.Txn was not initiating tracking when sending 1PC batches. However, tracking is needed for these transactions too: even though usually they'll succeed and so the TxnCoordSender state can be quickly destroyed, when they fail their intents and heartbeat loop need to be kept around just like for any other txn. This patch backtracks on the previous move to make it the client's responsibility to initiate tracking (it didn't stand the test of time): the client.Txn is no longer in charge of calling tcs.StartTracking(). Instead, the TCS does whatever needs to be done when it sees an EndTransaction. I also took the opportunity to spruce up comments on the TxnCoordSender. Release note: None
26497: kv: fix issues around failed 1PC txns r=andreimatei a=andreimatei The recent #25541 changed the way "tracking" (the heartbeat loop and intent collection) is initiated for transactions. It aimed to simplify things and put the burden on the client to decide when a txn needs tracking. This introduced a problem - the client.Txn was not initiating tracking when sending 1PC batches. However, tracking is needed for these transactions too: even though usually they'll succeed and so the TxnCoordSender state can be quickly destroyed, when they fail their intents and heartbeat loop need to be kept around just like for any other txn. This patch backtracks on the previous move to make it the client's responsibility to initiate tracking (it didn't stand the test of time): the client.Txn is no longer in charge of calling tcs.StartTracking(). Instead, the TCS does whatever needs to be done when it sees an EndTransaction. I also took the opportunity to spruce up comments on the TxnCoordSender. Release note: None Co-authored-by: Andrei Matei <[email protected]>
This store dump version was accidentally broken in cockroachdb#25541. Release note: None
26787: rfc: ALTER COLUMN TYPE TableAction plan r=bobvawter a=bobvawter FYI: I was doing this as a separate document, but since it's motivated by `ALTER_COLUMN_TYPE`, it seems like it belongs in this doc. The TL;DR is that `TableMutations` is doing two jobs and doesn't have enough specificity to effectively handle multi-step schema changes. Update the `ALTER COLUMN TYPE` RFC with a proposed plan for migrating state management from `TableDescriptor.Mutations` into a new `TableActions` collection. Release note: None 26807: roachtest: fix store dump version r=nvanbenschoten a=nvanbenschoten This store dump version was accidentally broken in #25541. Release note: None Co-authored-by: Bob Vawter <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
client.Txn and TCS try to maintain a bunch of logically-redundant state
about whether a transaction is "writing" - essentially whether an
EndTransaction needs to be sent to cleanup up the TCS heartbeat loop and
the server's txn record.
The logic that both parties used for this was complex (e.g. it involved
updates in both Txn and TCS both on the outgoing path and on the
returning path of a batch) and not in sync - sometimes the TCS would
consider the txn as "writing" and the client.Txn wouldn't (e.g. in case
the first writing batch got an ambiguous error).
This patch simplifies things: the idea is that, if a BeginTxn has been
sent, an EndTransaction needs to be sent, period. The client.Txn thus
only keeps track of whether a BeginTxn was sent (except for a 1PC
batch), and it takes charge of starting the TCS' heartbeat loop (by
instructing it explicitly directly to start it before the BeginTxn is
sent). The TCS is no longer burdened with maintaining any state about
whether there is a txn record or not.
As a byproduct, the proto Transaction.Writing flag, which used to have
an unclear meaning, becomes straight forward: if set, the server needs
to check batches against the abort cache. The client is the only one
setting it, the server is the only one checking it. It used to be used
for different purposeses by both the client and server.
Release note: none