Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: we should create txn records eagerly on retries #57042

Closed
andreimatei opened this issue Nov 24, 2020 · 12 comments · Fixed by #67215
Closed

kv: we should create txn records eagerly on retries #57042

andreimatei opened this issue Nov 24, 2020 · 12 comments · Fixed by #67215
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Nov 24, 2020

In https://github.com/cockroachlabs/support/issues/674 we see some contending queries thrashing and constantly retrying. These queries are delete from system.jobs where id=any(...list of 1000 ids). We attempt to run this query as a 1PC. The situation would probably be greatly improved if these queries would have a txn record, and thus they wouldn't constantly push/abort each other.

We already have code that splits up the EndTxn from the rest of the batch at the DistSender level when epoch > 0. We should do something similar for the heartbeat interceptor, and start heartbeating on the first write when epoch > 0. We need to pay attention to set the EndTxn.TxnHeartbeating flag.
We probably also want to cover the TransactionAbortedError cases with this policy, so one thing we've discussed is finding a way to start transactions resulting from a TransactionAbortedError at a higher epoch - basically use the epoch as a retry counter across retry types.

cc @nvanbenschoten

Epic: CRDB-8282

gz#9005

@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Nov 24, 2020
@andreimatei andreimatei self-assigned this Nov 24, 2020
@tbg tbg added the O-postmortem Originated from a Postmortem action item. label May 20, 2021
@lunevalex lunevalex added the N-followup Needs followup. label May 28, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@tbg
Copy link
Member

tbg commented Jun 22, 2021

Wrote this on an internal support issue https://github.com/cockroachlabs/support/issues/1019#issuecomment-865923066, thought I'd replicate it here:


In a perfect world, once the STAGING record is written, we'd quickly start the heartbeat loop. But this is not how it works with the current architecture and it's also not enough. Say you have the whole txn in a single batch Put(a) Put(b) Commit(a) and there's a range split between a and b. TxnCoordSender won't start the heartbeat loop as we discussed, but then at DistSender we actually send two batches, Put(a)Staging(a) and Put(b), in parallel, not one.

Now consider two cases separately: when there's a lot of contention on a xor on b.

contention on a: the first batch will basically spend a lot of time in contention handling. So it does not actually "ever" write the staging txn record. In the meantime, the second batch easily got through. Someone might try to push the intent at b; after a few seconds they'll just go and write an ABORTED txn record for our txn. Boom.

contention on b: this time the staging record shows up right away. Someone might content on a. They see our staging record, but nobody is heartbeating it, so after a few seconds they abort us by invalidating our planned write on b, which is stuck in contention. All the while, DistSender has gotten a response from the Put(a)Staging(a) batch, but it cannot return up the stack to TxnCoordSender (which owns starting the heartbeat loop) until the other batch has also come back, which won't happen in time.

@tbg
Copy link
Member

tbg commented Jun 22, 2021

Now in a "truly ideal" world, we'd know "exactly" (to the degree that this is possible on the kvcoord without talking to kvserver) when to start the hb loop, no? We want to start the heartbeat loop when we split the batch in DistSender (or when we know for other reasons that this won't be a 1PC txn, etc). The problem is really that the hb loop is owned by TxnCoordSender, and it doesn't know what DistSender is doing.

It seems that in the current proposal we are trying, maybe a bit too hard, to bend over backwards to accommodate the way things are currently set up. In a sense, if we squint a bit and say that DistSender is really the bottom of the TxnCoordSender stack, aren't we seeing here that txnHeartbeater should be at the bottom (or at least under DistSender)? That way it could see the individual batches sent by DistSender and its "omit heartbeat loop if all is in one batch" heuristic would be more appropriate, in fact, wouldn't it be optimal? If it omitted the heartbeat loop and the batch wasn't 1PC, the scenarios above where some intents are blocked and the txn becomes vulnerable don't apply (since if nothing else we are sure that the batch hits a single range only, so its intents become visible in a single raft command). We'd start the heartbeat loop only once the batch was written (in one go on a single range), but that's fine.

@sumeerbhola
Copy link
Collaborator

contention on a: the first batch will basically spend a lot of time in contention handling. So it does not actually "ever" write the staging txn record. ...

It sounds like we are still vulnerable if the range where the txn record needs to be written also has other writes that are contended, while a different range with writes has no contention. On retry we should also be splitting of the txn record write into its own batch so that it is written without contention otherwise it would repeat the same cycle.

@tbg
Copy link
Member

tbg commented Jun 22, 2021

No, if we start a heartbeat loop that will create the txn record, even if the batch that wants to make it STAGING is held up.

@tbg
Copy link
Member

tbg commented Jun 22, 2021

(and if we do something like #57042 (comment) then the only case in which we don't start the hb loop is that in which the RPC actually hits a single range, so intents will never be visible before the txn record)

@erikgrinaker
Copy link
Contributor

aren't we seeing here that txnHeartbeater should be at the bottom (or at least under DistSender)?

It needs to be above some of the other interceptors to process EndTxn(commit=false) requests that it sends when finding an aborted record. In particular, it needs the txnPipeliner to attach the lock spans.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 1, 2021

Have looked into the DistSender/TxnCoordSender coordination here a bit. I think the best approach would be to have DistSender.Send() return a typed error similar to errNo1PCTxn when the 1PC batch needs to be split, and move the responsibility for splitting the batch up into the TxnCoordSender (probably the txnCommitter interceptor). This would allow it to kick off the heartbeat loop without coupling the DistSender and TxnCoordSender.

If this doesn't work out for whatever reason, plan B is to use a sync.Map on the DistSender containing channels keyed by txn ID, and close the registered channel (if any) whenever the DistSender splits a batch. We would have some DistSender mechanism (e.g. a function) to register split triggers, and inject this into the TxnCoordSenderFactory. This would prevent tight coupling or Sender interface changes, but it does indicate that we've gotten our abstraction boundaries wrong so I'd prefer to avoid it if possible.

@nvanbenschoten
Copy link
Member

Have looked into the DistSender/TxnCoordSender coordination here a bit. I think the best approach would be to have DistSender.Send() return a typed error similar to errNo1PCTxn when the 1PC batch needs to be split, and move the responsibility for splitting the batch up into the TxnCoordSender (probably the txnCommitter interceptor). This would allow it to kick off the heartbeat loop without coupling the DistSender and TxnCoordSender.

This is an interesting idea. A slight variation of it would be to keep the responsibility for splitting a batch in the DistSender, but to inform the DistSender of when a BatchRequest is sent for a txn that is not yet heartbeating - which we actually already do with EndTxn.TxnHeartbeating. We could then check this flag when we see a batch that needs to be split with an EndTxn request. In such cases, we would throw an error. We would then catch this error in the txnHeartbeater, kick off the heartbeat loop, flip the flag to true, and re-issue the batch.

This may have an impact on performance, we'd want to check. The most trivial workload that would be impacted by this back and forth would be single-row inserts into a 2 index table: kv --read-percent=0 --secondary-index.


I think it would also be worth doing an experiment on how expensive it would be to always assume a transaction could have sent a heartbeat and always set EndTxn.TxnHeartbeating to true, which mandates a read of the transaction record here for 1PC txns. If this is cheap enough, maybe we should consider kicking off the transaction heartbeat loop unconditionally. I'm sure we could optimize the client-side of things to make this cheap in the vastly common case where the batch returns before the first heartbeat is fired. For instance, we could defer the txn heartbeat goroutine creation for DefaultTxnHeartbeatInterval by handing the entire thing to time.AfterFunc and then cancel the timer on the way back up.

Also, if it wasn't cheap enough, we may be able to do something with the timestamp cache to make the batcheval.HasTxnRecord check cheaper. For instance, if we bumped the timestamp cache on txn record creation instead of removal, maybe (*Replica).CanCreateTxnRecord would be all we need to check in (*Replica).canAttempt1PCEvaluation.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 4, 2021

This is an interesting idea. A slight variation of it would be to keep the responsibility for splitting a batch in the DistSender, but to inform the DistSender of when a BatchRequest is sent for a txn that is not yet heartbeating - which we actually already do with EndTxn.TxnHeartbeating. We could then check this flag when we see a batch that needs to be split with an EndTxn request. In such cases, we would throw an error. We would then catch this error in the txnHeartbeater, kick off the heartbeat loop, flip the flag to true, and re-issue the batch.

Yeah, that's an interesting variation, thanks! I'll try a few different approaches and see what works.

I think it would also be worth doing an experiment on how expensive it would be to always assume a transaction could have sent a heartbeat and always set EndTxn.TxnHeartbeating to true, which mandates a read of the transaction record here for 1PC txns. If this is cheap enough, maybe we should consider kicking off the transaction heartbeat loop unconditionally.

I really like this suggestion, didn't realize that was an option (since we'd presumably have tried it before and chosen to disable heartbeats instead). Will give it a shot.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 5, 2021

I think it would also be worth doing an experiment on how expensive it would be to always assume a transaction could have sent a heartbeat and always set EndTxn.TxnHeartbeating to true, which mandates a read of the transaction record here for 1PC txns. If this is cheap enough, maybe we should consider kicking off the transaction heartbeat loop unconditionally.

I really like this suggestion, didn't realize that was an option (since we'd presumably have tried it before and chosen to disable heartbeats instead). Will give it a shot.

Initial results are promising, on kv95/enc=false/nodes=3/cpu=32 it actually showed a 5% improvement (i.e. cost is likely negligible):

name                           old ops/sec  new ops/sec  delta
kv95/enc=false/nodes=3/cpu=32    124k ± 7%    131k ± 2%  +5.56%  (p=0.002 n=10+8)

name                           old p50      new p50      delta
kv95/enc=false/nodes=3/cpu=32    1.00 ± 0%    1.00 ± 0%    ~     (all equal)

name                           old p95      new p95      delta
kv95/enc=false/nodes=3/cpu=32    5.02 ±10%    4.70 ± 0%  -6.37%  (p=0.003 n=10+8)

name                           old p99      new p99      delta
kv95/enc=false/nodes=3/cpu=32    9.79 ± 7%    9.21 ± 3%  -5.90%  (p=0.015 n=10+8)

I'm going to verify this and do some more benchmarks (notably with kv0), but it looks like this may be viable.

@erikgrinaker
Copy link
Contributor

Doesn't seem to have any significant effect (new always starts heartbeat loop):

name                          old ops/sec  new ops/sec  delta
ycsb/A/nodes=3/cpu=32          27.7k ± 4%   28.1k ± 4%   ~     (p=0.315 n=9+10)
kv0/enc=false/nodes=1/cpu=32   24.5k ± 8%   24.7k ± 3%   ~     (p=0.796 n=9+9)
kv0/enc=false/nodes=3/cpu=32   36.7k ± 4%   36.5k ± 4%   ~     (p=0.529 n=10+10)

name                          old p50      new p50      delta
ycsb/A/nodes=3/cpu=32           2.44 ± 2%    2.45 ± 2%   ~     (p=1.000 n=10+10)
kv0/enc=false/nodes=1/cpu=32    2.50 ± 0%    2.50 ± 0%   ~     (all equal)
kv0/enc=false/nodes=3/cpu=32    4.70 ± 0%    4.70 ± 0%   ~     (all equal)

name                          old p95      new p95      delta
ycsb/A/nodes=3/cpu=32           9.10 ± 8%    9.20 ± 3%   ~     (p=0.720 n=10+10)
kv0/enc=false/nodes=1/cpu=32    4.79 ± 9%    4.70 ± 0%   ~     (p=0.350 n=9+7)
kv0/enc=false/nodes=3/cpu=32    10.7 ± 3%    10.8 ± 7%   ~     (p=1.000 n=10+10)

name                          old p99      new p99      delta
ycsb/A/nodes=3/cpu=32           78.0 ± 9%    75.5 ± 6%   ~     (p=0.305 n=10+10)
kv0/enc=false/nodes=1/cpu=32    6.90 ±10%    6.87 ± 6%   ~     (p=1.000 n=9+9)
kv0/enc=false/nodes=3/cpu=32    14.4 ± 9%    14.4 ± 6%   ~     (p=0.707 n=10+9)

@tbg On the issue, you said "starting a HB loop for an actual 1PC txn has some thorny edge cases", I'd like to know more about this.

@nvanbenschoten
Copy link
Member

I really like this suggestion, didn't realize that was an option (since we'd presumably have tried it before and chosen to disable heartbeats instead). Will give it a shot.

I think the progression was the other way around, which may be why we never tested this option thoroughly. For the longest time, it wasn't possible for a transaction record to exist for a 1PC transaction. This changed when we introduced unreplicated locks, like the ones we acquire during the initial row scan of an UPDATE. With their addition, we "revealed" the existence of a transaction that could eventually perform a 1 phase commit to others earlier, so other txns would try to abort it. So we then had to start the txn loop for these txns. We realized after that that it was a serious bug to have a transaction with a txn record perform a 1 phase commit. But we also didn't want to pessimize existing 1PC txns and assumed the transaction record read would be too expensive. So we did our best to perform this HasTxnRecord check only when necessary.

Your experiments indicate that this assumption may have been incorrect.

@craig craig bot closed this as completed in 64c888f Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants