Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: Don't heartbeat transactions that are lacking an anchor key #26765

Closed

Conversation

a-robinson
Copy link
Contributor

When running TPC-C 10k on a 30 node cluster without partitioning, range
1 was receiving thousands of qps while all other ranges were receiving
no more than low hundreds of qps (more details in #26608. Part of it was
context cancellations causing range descriptors to be evicted from the
range cache (#26764), but an even bigger part of it was HeartbeatTxns
being sent for transactions with no anchor key, accounting for thousands
of QPS even after #26764 was fixed.

This causes the same outcome as the old code without the load, because
without this change we'd just send the request and get back a
REASON_TXN_NOT_FOUND error, which would cause the function to return
true.

It's possible that we should instead avoid the heartbeat loop at all for
transactions without a key, or that we should put in more effort to
prevent such requests from even counting as transactions (a la #26741,
which perhaps makes this change unnecessary?). Advice would be great.

Release note: None

When running TPC-C 10k on a 30 node cluster without partitioning, range
1 was receiving thousands of qps while all other ranges were receiving
no more than low hundreds of qps (more details in cockroachdb#26608. Part of it was
context cancellations causing range descriptors to be evicted from the
range cache (cockroachdb#26764), but an even bigger part of it was HeartbeatTxns
being sent for transactions with no anchor key, accounting for thousands
of QPS even after cockroachdb#26764 was fixed.

This causes the same outcome as the old code without the load, because
without this change we'd just send the request and get back a
REASON_TXN_NOT_FOUND error, which would cause the function to return
true.

It's possible that we should instead avoid the heartbeat loop at all for
transactions without a key, or that we should put in more effort to
prevent such requests from even counting as transactions (a la cockroachdb#26741,
which perhaps makes this change unnecessary?). Advice would be great.

Release note: None
@a-robinson a-robinson requested review from andreimatei, nvanbenschoten and a team June 16, 2018 04:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@spencerkimball spencerkimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last time I looked at this code the heartbeat loop wouldn’t get started unless the txn had a key assigned. One of the recent refactoring a must have changed something...? But I just looked at the complete source and the txn heartbeat loop doesn’t start without a begin transaction request. That request can’t be sent without a key obviously. Might make sense to dig a little deeper and figure out how the loop is getting started without a key. Also, are we sure this is the only place that a heartbeat RPC is invoked?

@nvanbenschoten
Copy link
Member

It's pretty surprising to me that a heartbeat loop is being started without the txn having a Key set. It's possible that we're seeing a similar broken interaction to the one that prompted this logic, which we really should remove. This area of the code has been pretty in-flux for the past few weeks, so I'll defer to @andreimatei.

@andreimatei
Copy link
Contributor

@a-robinson I'm sure this is a great find, but you could you detail a little more what you're seeing? As Spencer says, we should only be starting a hb loop after a BeginTxn request, and at that point there should definitely be a key set. Now, as Nathan says, it could be that the key "is set", but set on the wrong copy of the proto (the fact that "the client" has multiple copies is terrible and one of the main reasons I'm currently munging this code).
If you could put some assertions around how that heartbeat is starting, maybe that'll help us track it down. I can also take it over if you prefer, if you robot me up and tell me what's what.


Review status: :shipit: complete! 0 of 0 LGTMs obtained


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Unfortunately I don't have information on how we were getting here, and I shut my large cluster off for the weekend so getting more information will require some setup work. It's very possible that running the tpcc workload will reproduce it on much smaller clusters, though. I'll follow up with @andreimatei offline. I don't have much to add other than repro steps, though.

Also, are we sure this is the only place that a heartbeat RPC is invoked?

Yes, I did confirm that. And I confirmed that this stopped the thousands of QPS of heartbeats to r1.

Copy link
Member

@spencerkimball spencerkimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what the problem is. We start the heartbeat loop when we see a batch which has a BeginTransaction. However, the TxnCoordSender still has an old copy of the Transaction object without a key, which is what's used to do the heartbeat. The TxnCoordSender's copy of the Transaction isn't updated (and thus given the correct Key) until after the batch with the BeginTransaction completes. In that time, we could start the heartbeat.

Like @andreimatei said, this is another instance of the sort of problems you run into when you've got two copies of the Transaction to maintain. I actually think this PR is spot on; just ignore the heartbeat if Key == nil. In other words, we may have started the heartbeat loop, but we don't actually heartbeat until after the batch with the BeginTransaction has returned success.

@@ -1086,6 +1086,10 @@ func (tc *TxnCoordSender) heartbeat(ctx context.Context) bool {
return false
}

if txn.Key == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to add a comment here explaining why this might be nil:

  // txn.Key may be nil here if the batch containing the `BeginTransaction` has not
  // yet returned successfully. Just skip the heartbeat in this case and wait for the
  // next heartbeat loop tick.

@andreimatei
Copy link
Contributor

Mmmm I'd rather do the opposite - make it so that the key is set on every copy of that damn proto.
I think starting the heartbeat loop early, before we're sure that there's a txn record written, is generally a good thing. For example, if the BeginTxn gets a retriable error but intents might have been laid down, we want the heartbeating. Also we've talked about making the BeginTxn best effort and allowing the heartbeat loop to create the txn record.
So let me try to fix it by updating the TCS's proto on a request's way out.


Review status: :shipit: complete! 0 of 0 LGTMs obtained


Comments from Reviewable

@andreimatei
Copy link
Contributor

andreimatei commented Jun 18, 2018 via email

@bdarnell
Copy link
Contributor

:lgtm: I think we should do both this and #26811: make this change first, and backport it to 2.0. Then we can merge #26811 and change the no-op here to an assertion.


Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


Comments from Reviewable

@nvanbenschoten
Copy link
Member

I think this issue was introduced in #25541, so we shouldn't need to backport. @andreimatei please correct me if I'm wrong.

@andreimatei
Copy link
Contributor

Yes, the troubles were new. Nothing needs to be back-ported. I'd drop this PR.

@a-robinson
Copy link
Contributor Author

Dropping it works for me given that #26811 has an assertion against heartbeating an empty key. Thanks for the proper fix, @andreimatei.

@a-robinson a-robinson closed this Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants