kv: we should create txn records eagerly on retries #57042

andreimatei · 2020-11-24T00:55:08Z

In https://github.com/cockroachlabs/support/issues/674 we see some contending queries thrashing and constantly retrying. These queries are delete from system.jobs where id=any(...list of 1000 ids). We attempt to run this query as a 1PC. The situation would probably be greatly improved if these queries would have a txn record, and thus they wouldn't constantly push/abort each other.

We already have code that splits up the EndTxn from the rest of the batch at the DistSender level when epoch > 0. We should do something similar for the heartbeat interceptor, and start heartbeating on the first write when epoch > 0. We need to pay attention to set the EndTxn.TxnHeartbeating flag.
We probably also want to cover the TransactionAbortedError cases with this policy, so one thing we've discussed is finding a way to start transactions resulting from a TransactionAbortedError at a higher epoch - basically use the epoch as a retry counter across retry types.

cc @nvanbenschoten

Epic: CRDB-8282

gz#9005

The text was updated successfully, but these errors were encountered:

tbg · 2021-06-22T12:03:01Z

Wrote this on an internal support issue https://github.com/cockroachlabs/support/issues/1019#issuecomment-865923066, thought I'd replicate it here:

In a perfect world, once the STAGING record is written, we'd quickly start the heartbeat loop. But this is not how it works with the current architecture and it's also not enough. Say you have the whole txn in a single batch Put(a) Put(b) Commit(a) and there's a range split between a and b. TxnCoordSender won't start the heartbeat loop as we discussed, but then at DistSender we actually send two batches, Put(a)Staging(a) and Put(b), in parallel, not one.

Now consider two cases separately: when there's a lot of contention on a xor on b.

contention on a: the first batch will basically spend a lot of time in contention handling. So it does not actually "ever" write the staging txn record. In the meantime, the second batch easily got through. Someone might try to push the intent at b; after a few seconds they'll just go and write an ABORTED txn record for our txn. Boom.

contention on b: this time the staging record shows up right away. Someone might content on a. They see our staging record, but nobody is heartbeating it, so after a few seconds they abort us by invalidating our planned write on b, which is stuck in contention. All the while, DistSender has gotten a response from the Put(a)Staging(a) batch, but it cannot return up the stack to TxnCoordSender (which owns starting the heartbeat loop) until the other batch has also come back, which won't happen in time.

tbg · 2021-06-22T12:13:48Z

Now in a "truly ideal" world, we'd know "exactly" (to the degree that this is possible on the kvcoord without talking to kvserver) when to start the hb loop, no? We want to start the heartbeat loop when we split the batch in DistSender (or when we know for other reasons that this won't be a 1PC txn, etc). The problem is really that the hb loop is owned by TxnCoordSender, and it doesn't know what DistSender is doing.

It seems that in the current proposal we are trying, maybe a bit too hard, to bend over backwards to accommodate the way things are currently set up. In a sense, if we squint a bit and say that DistSender is really the bottom of the TxnCoordSender stack, aren't we seeing here that txnHeartbeater should be at the bottom (or at least under DistSender)? That way it could see the individual batches sent by DistSender and its "omit heartbeat loop if all is in one batch" heuristic would be more appropriate, in fact, wouldn't it be optimal? If it omitted the heartbeat loop and the batch wasn't 1PC, the scenarios above where some intents are blocked and the txn becomes vulnerable don't apply (since if nothing else we are sure that the batch hits a single range only, so its intents become visible in a single raft command). We'd start the heartbeat loop only once the batch was written (in one go on a single range), but that's fine.

sumeerbhola · 2021-06-22T12:43:35Z

contention on a: the first batch will basically spend a lot of time in contention handling. So it does not actually "ever" write the staging txn record. ...

It sounds like we are still vulnerable if the range where the txn record needs to be written also has other writes that are contended, while a different range with writes has no contention. On retry we should also be splitting of the txn record write into its own batch so that it is written without contention otherwise it would repeat the same cycle.

tbg · 2021-06-22T13:57:08Z

No, if we start a heartbeat loop that will create the txn record, even if the batch that wants to make it STAGING is held up.

tbg · 2021-06-22T13:58:16Z

(and if we do something like #57042 (comment) then the only case in which we don't start the hb loop is that in which the RPC actually hits a single range, so intents will never be visible before the txn record)

erikgrinaker · 2021-06-28T12:47:32Z

aren't we seeing here that txnHeartbeater should be at the bottom (or at least under DistSender)?

It needs to be above some of the other interceptors to process EndTxn(commit=false) requests that it sends when finding an aborted record. In particular, it needs the txnPipeliner to attach the lock spans.

erikgrinaker · 2021-07-01T12:51:17Z

Have looked into the DistSender/TxnCoordSender coordination here a bit. I think the best approach would be to have DistSender.Send() return a typed error similar to errNo1PCTxn when the 1PC batch needs to be split, and move the responsibility for splitting the batch up into the TxnCoordSender (probably the txnCommitter interceptor). This would allow it to kick off the heartbeat loop without coupling the DistSender and TxnCoordSender.

If this doesn't work out for whatever reason, plan B is to use a sync.Map on the DistSender containing channels keyed by txn ID, and close the registered channel (if any) whenever the DistSender splits a batch. We would have some DistSender mechanism (e.g. a function) to register split triggers, and inject this into the TxnCoordSenderFactory. This would prevent tight coupling or Sender interface changes, but it does indicate that we've gotten our abstraction boundaries wrong so I'd prefer to avoid it if possible.

nvanbenschoten · 2021-07-01T22:06:17Z

Have looked into the DistSender/TxnCoordSender coordination here a bit. I think the best approach would be to have DistSender.Send() return a typed error similar to errNo1PCTxn when the 1PC batch needs to be split, and move the responsibility for splitting the batch up into the TxnCoordSender (probably the txnCommitter interceptor). This would allow it to kick off the heartbeat loop without coupling the DistSender and TxnCoordSender.

This is an interesting idea. A slight variation of it would be to keep the responsibility for splitting a batch in the DistSender, but to inform the DistSender of when a BatchRequest is sent for a txn that is not yet heartbeating - which we actually already do with EndTxn.TxnHeartbeating. We could then check this flag when we see a batch that needs to be split with an EndTxn request. In such cases, we would throw an error. We would then catch this error in the txnHeartbeater, kick off the heartbeat loop, flip the flag to true, and re-issue the batch.

This may have an impact on performance, we'd want to check. The most trivial workload that would be impacted by this back and forth would be single-row inserts into a 2 index table: kv --read-percent=0 --secondary-index.

I think it would also be worth doing an experiment on how expensive it would be to always assume a transaction could have sent a heartbeat and always set EndTxn.TxnHeartbeating to true, which mandates a read of the transaction record here for 1PC txns. If this is cheap enough, maybe we should consider kicking off the transaction heartbeat loop unconditionally. I'm sure we could optimize the client-side of things to make this cheap in the vastly common case where the batch returns before the first heartbeat is fired. For instance, we could defer the txn heartbeat goroutine creation for DefaultTxnHeartbeatInterval by handing the entire thing to time.AfterFunc and then cancel the timer on the way back up.

Also, if it wasn't cheap enough, we may be able to do something with the timestamp cache to make the batcheval.HasTxnRecord check cheaper. For instance, if we bumped the timestamp cache on txn record creation instead of removal, maybe (*Replica).CanCreateTxnRecord would be all we need to check in (*Replica).canAttempt1PCEvaluation.

erikgrinaker · 2021-07-04T16:21:24Z

This is an interesting idea. A slight variation of it would be to keep the responsibility for splitting a batch in the DistSender, but to inform the DistSender of when a BatchRequest is sent for a txn that is not yet heartbeating - which we actually already do with EndTxn.TxnHeartbeating. We could then check this flag when we see a batch that needs to be split with an EndTxn request. In such cases, we would throw an error. We would then catch this error in the txnHeartbeater, kick off the heartbeat loop, flip the flag to true, and re-issue the batch.

Yeah, that's an interesting variation, thanks! I'll try a few different approaches and see what works.

I think it would also be worth doing an experiment on how expensive it would be to always assume a transaction could have sent a heartbeat and always set EndTxn.TxnHeartbeating to true, which mandates a read of the transaction record here for 1PC txns. If this is cheap enough, maybe we should consider kicking off the transaction heartbeat loop unconditionally.

I really like this suggestion, didn't realize that was an option (since we'd presumably have tried it before and chosen to disable heartbeats instead). Will give it a shot.

erikgrinaker · 2021-07-05T10:03:02Z

I think it would also be worth doing an experiment on how expensive it would be to always assume a transaction could have sent a heartbeat and always set EndTxn.TxnHeartbeating to true, which mandates a read of the transaction record here for 1PC txns. If this is cheap enough, maybe we should consider kicking off the transaction heartbeat loop unconditionally.

I really like this suggestion, didn't realize that was an option (since we'd presumably have tried it before and chosen to disable heartbeats instead). Will give it a shot.

Initial results are promising, on kv95/enc=false/nodes=3/cpu=32 it actually showed a 5% improvement (i.e. cost is likely negligible):

name                           old ops/sec  new ops/sec  delta
kv95/enc=false/nodes=3/cpu=32    124k ± 7%    131k ± 2%  +5.56%  (p=0.002 n=10+8)

name                           old p50      new p50      delta
kv95/enc=false/nodes=3/cpu=32    1.00 ± 0%    1.00 ± 0%    ~     (all equal)

name                           old p95      new p95      delta
kv95/enc=false/nodes=3/cpu=32    5.02 ±10%    4.70 ± 0%  -6.37%  (p=0.003 n=10+8)

name                           old p99      new p99      delta
kv95/enc=false/nodes=3/cpu=32    9.79 ± 7%    9.21 ± 3%  -5.90%  (p=0.015 n=10+8)

I'm going to verify this and do some more benchmarks (notably with kv0), but it looks like this may be viable.

erikgrinaker · 2021-07-05T12:14:50Z

Doesn't seem to have any significant effect (new always starts heartbeat loop):

name                          old ops/sec  new ops/sec  delta
ycsb/A/nodes=3/cpu=32          27.7k ± 4%   28.1k ± 4%   ~     (p=0.315 n=9+10)
kv0/enc=false/nodes=1/cpu=32   24.5k ± 8%   24.7k ± 3%   ~     (p=0.796 n=9+9)
kv0/enc=false/nodes=3/cpu=32   36.7k ± 4%   36.5k ± 4%   ~     (p=0.529 n=10+10)

name                          old p50      new p50      delta
ycsb/A/nodes=3/cpu=32           2.44 ± 2%    2.45 ± 2%   ~     (p=1.000 n=10+10)
kv0/enc=false/nodes=1/cpu=32    2.50 ± 0%    2.50 ± 0%   ~     (all equal)
kv0/enc=false/nodes=3/cpu=32    4.70 ± 0%    4.70 ± 0%   ~     (all equal)

name                          old p95      new p95      delta
ycsb/A/nodes=3/cpu=32           9.10 ± 8%    9.20 ± 3%   ~     (p=0.720 n=10+10)
kv0/enc=false/nodes=1/cpu=32    4.79 ± 9%    4.70 ± 0%   ~     (p=0.350 n=9+7)
kv0/enc=false/nodes=3/cpu=32    10.7 ± 3%    10.8 ± 7%   ~     (p=1.000 n=10+10)

name                          old p99      new p99      delta
ycsb/A/nodes=3/cpu=32           78.0 ± 9%    75.5 ± 6%   ~     (p=0.305 n=10+10)
kv0/enc=false/nodes=1/cpu=32    6.90 ±10%    6.87 ± 6%   ~     (p=1.000 n=9+9)
kv0/enc=false/nodes=3/cpu=32    14.4 ± 9%    14.4 ± 6%   ~     (p=0.707 n=10+9)

@tbg On the issue, you said "starting a HB loop for an actual 1PC txn has some thorny edge cases", I'd like to know more about this.

nvanbenschoten · 2021-07-07T15:01:54Z

I really like this suggestion, didn't realize that was an option (since we'd presumably have tried it before and chosen to disable heartbeats instead). Will give it a shot.

I think the progression was the other way around, which may be why we never tested this option thoroughly. For the longest time, it wasn't possible for a transaction record to exist for a 1PC transaction. This changed when we introduced unreplicated locks, like the ones we acquire during the initial row scan of an UPDATE. With their addition, we "revealed" the existence of a transaction that could eventually perform a 1 phase commit to others earlier, so other txns would try to abort it. So we then had to start the txn loop for these txns. We realized after that that it was a serious bug to have a transaction with a txn record perform a 1 phase commit. But we also didn't want to pessimize existing 1PC txns and assumed the transaction record read would be too expensive. So we did our best to perform this HasTxnRecord check only when necessary.

Your experiments indicate that this assumption may have been incorrect.

andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Nov 24, 2020

andreimatei self-assigned this Nov 24, 2020

tbg added the O-postmortem Originated from a Postmortem action item. label May 20, 2021

lunevalex added the N-followup Needs followup. label May 28, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

erikgrinaker assigned erikgrinaker and unassigned andreimatei Jun 28, 2021

erikgrinaker mentioned this issue Jul 5, 2021

kvcoord: always start TxnCoordSender heartbeat loop #67215

Merged

craig bot closed this as completed in 64c888f Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: we should create txn records eagerly on retries #57042

kv: we should create txn records eagerly on retries #57042

andreimatei commented Nov 24, 2020 •

edited by RoachietheSupportRoach

Loading

tbg commented Jun 22, 2021

tbg commented Jun 22, 2021 •

edited

Loading

sumeerbhola commented Jun 22, 2021

tbg commented Jun 22, 2021

tbg commented Jun 22, 2021

erikgrinaker commented Jun 28, 2021

erikgrinaker commented Jul 1, 2021 •

edited

Loading

nvanbenschoten commented Jul 1, 2021

erikgrinaker commented Jul 4, 2021 •

edited

Loading

erikgrinaker commented Jul 5, 2021 •

edited

Loading

erikgrinaker commented Jul 5, 2021

nvanbenschoten commented Jul 7, 2021

kv: we should create txn records eagerly on retries #57042

kv: we should create txn records eagerly on retries #57042

Comments

andreimatei commented Nov 24, 2020 • edited by RoachietheSupportRoach Loading

tbg commented Jun 22, 2021

tbg commented Jun 22, 2021 • edited Loading

sumeerbhola commented Jun 22, 2021

tbg commented Jun 22, 2021

tbg commented Jun 22, 2021

erikgrinaker commented Jun 28, 2021

erikgrinaker commented Jul 1, 2021 • edited Loading

nvanbenschoten commented Jul 1, 2021

erikgrinaker commented Jul 4, 2021 • edited Loading

erikgrinaker commented Jul 5, 2021 • edited Loading

erikgrinaker commented Jul 5, 2021

nvanbenschoten commented Jul 7, 2021

andreimatei commented Nov 24, 2020 •

edited by RoachietheSupportRoach

Loading

tbg commented Jun 22, 2021 •

edited

Loading

erikgrinaker commented Jul 1, 2021 •

edited

Loading

erikgrinaker commented Jul 4, 2021 •

edited

Loading

erikgrinaker commented Jul 5, 2021 •

edited

Loading