-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: Don't heartbeat transactions that are lacking an anchor key #26765
Conversation
When running TPC-C 10k on a 30 node cluster without partitioning, range 1 was receiving thousands of qps while all other ranges were receiving no more than low hundreds of qps (more details in cockroachdb#26608. Part of it was context cancellations causing range descriptors to be evicted from the range cache (cockroachdb#26764), but an even bigger part of it was HeartbeatTxns being sent for transactions with no anchor key, accounting for thousands of QPS even after cockroachdb#26764 was fixed. This causes the same outcome as the old code without the load, because without this change we'd just send the request and get back a REASON_TXN_NOT_FOUND error, which would cause the function to return true. It's possible that we should instead avoid the heartbeat loop at all for transactions without a key, or that we should put in more effort to prevent such requests from even counting as transactions (a la cockroachdb#26741, which perhaps makes this change unnecessary?). Advice would be great. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last time I looked at this code the heartbeat loop wouldn’t get started unless the txn had a key assigned. One of the recent refactoring a must have changed something...? But I just looked at the complete source and the txn heartbeat loop doesn’t start without a begin transaction request. That request can’t be sent without a key obviously. Might make sense to dig a little deeper and figure out how the loop is getting started without a key. Also, are we sure this is the only place that a heartbeat RPC is invoked?
It's pretty surprising to me that a heartbeat loop is being started without the txn having a |
@a-robinson I'm sure this is a great find, but you could you detail a little more what you're seeing? As Spencer says, we should only be starting a hb loop after a Review status: complete! 0 of 0 LGTMs obtained Comments from Reviewable |
Unfortunately I don't have information on how we were getting here, and I shut my large cluster off for the weekend so getting more information will require some setup work. It's very possible that running the tpcc workload will reproduce it on much smaller clusters, though. I'll follow up with @andreimatei offline. I don't have much to add other than repro steps, though.
Yes, I did confirm that. And I confirmed that this stopped the thousands of QPS of heartbeats to r1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what the problem is. We start the heartbeat loop when we see a batch which has a BeginTransaction
. However, the TxnCoordSender
still has an old copy of the Transaction
object without a key, which is what's used to do the heartbeat. The TxnCoordSender
's copy of the Transaction
isn't updated (and thus given the correct Key
) until after the batch with the BeginTransaction
completes. In that time, we could start the heartbeat.
Like @andreimatei said, this is another instance of the sort of problems you run into when you've got two copies of the Transaction
to maintain. I actually think this PR is spot on; just ignore the heartbeat if Key == nil
. In other words, we may have started the heartbeat loop, but we don't actually heartbeat until after the batch with the BeginTransaction
has returned success.
@@ -1086,6 +1086,10 @@ func (tc *TxnCoordSender) heartbeat(ctx context.Context) bool { | |||
return false | |||
} | |||
|
|||
if txn.Key == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to add a comment here explaining why this might be nil:
// txn.Key may be nil here if the batch containing the `BeginTransaction` has not
// yet returned successfully. Just skip the heartbeat in this case and wait for the
// next heartbeat loop tick.
Mmmm I'd rather do the opposite - make it so that the key is set on every copy of that damn proto. Review status: complete! 0 of 0 LGTMs obtained Comments from Reviewable |
I've sent out #26811, which is my preferred thing.
…On Mon, Jun 18, 2018 at 11:52 AM, Andrei Matei ***@***.***> wrote:
Mmmm I'd rather do the opposite - make it so that the key is set on every
copy of that damn proto.
I think starting the heartbeat loop early, before we're sure that there's
a txn record written, is generally a good thing. For example, if the
BeginTxn gets a retriable error but intents might have been laid down, we
want the heartbeating. Also we've talked about making the BeginTxn best
effort and allowing the heartbeat loop to create the txn record.
So let me try to fix it by updating the TCS's proto on a request's way out.
------------------------------
Review status: [image: ] complete! 0 of 0 LGTMs obtained
------------------------------
*Comments from Reviewable
<https://reviewable.io/reviews/cockroachdb/cockroach/26765#-:-LFIYVcVCEplRTG_FB0Q:b-6w44g9>*
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#26765 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBcadzZR5t1gj-32NU-ymLFl-Jrh6Bks5t98zIgaJpZM4UqWs1>
.
|
I think we should do both this and #26811: make this change first, and backport it to 2.0. Then we can merge #26811 and change the no-op here to an assertion. Review status: complete! 0 of 0 LGTMs obtained (and 1 stale) Comments from Reviewable |
I think this issue was introduced in #25541, so we shouldn't need to backport. @andreimatei please correct me if I'm wrong. |
Yes, the troubles were new. Nothing needs to be back-ported. I'd drop this PR. |
Dropping it works for me given that #26811 has an assertion against heartbeating an empty key. Thanks for the proper fix, @andreimatei. |
When running TPC-C 10k on a 30 node cluster without partitioning, range
1 was receiving thousands of qps while all other ranges were receiving
no more than low hundreds of qps (more details in #26608. Part of it was
context cancellations causing range descriptors to be evicted from the
range cache (#26764), but an even bigger part of it was HeartbeatTxns
being sent for transactions with no anchor key, accounting for thousands
of QPS even after #26764 was fixed.
This causes the same outcome as the old code without the load, because
without this change we'd just send the request and get back a
REASON_TXN_NOT_FOUND error, which would cause the function to return
true.
It's possible that we should instead avoid the heartbeat loop at all for
transactions without a key, or that we should put in more effort to
prevent such requests from even counting as transactions (a la #26741,
which perhaps makes this change unnecessary?). Advice would be great.
Release note: None