-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: pipeline transactional writes #26599
kv: pipeline transactional writes #26599
Conversation
d4d02c7
to
ef17628
Compare
This change simplifies the process of releasing quota back into the quota pool on failed proposals. It does so by pushing all quota management into Raft instead of having the request's goroutine try to determine whether it needs to release quota based on the error in the proposal result. Quota management used to only be handled by Raft for successful proposals. In doing so, the change removes a case where quota could be double freed: proposals that apply and then return an error. This isn't particularly important though, because this pattern is never used anymore. Besides being a simplification, this is important for cockroachdb#26599 because transaction pipelining depends on requests pushing proposals into Raft and then returning immediately. In that model (asyncConsensus) the request's goroutine is not around to release quota resources on failed proposals. Release note: None
e89c2d1
to
f6ebac7
Compare
f6ebac7
to
2e6b47b
Compare
2e6b47b
to
b77ce75
Compare
This is now rebased on top of #27066. All commits other than the first are part of this PR and it is ready to be reviewed! I'm starting to perform a more in-depth performance comparison with this change. It's pretty clear that this is beneficial for any geo-distributed cluster where consensus latency dominates all other costs. However, until now it hasn't been clear how this will fare on more standard deployments. I went through each workload in our workload generator and performed a series of benchmarks where I turned pipelining on and off using the new Note: all tests were run with n1-standard-16 machines in us-east-1b kvNon-transactional, not affected. bankNon-transactional, not affected tpcc1 node with fks3 nodes with fksledger3 nodesNote: the periodic dips (every 70 seconds) corresponded exactly with RocksDB compactions and were present with and without pipelining As the various benchmark runs show, even without a distributed cluster with high consensus round-trip latency, transactional pipelining still significantly improves throughput. For tpcc, this improvement in throughput was around 35%. For ledger, it was around 30%. This is almost certainly because it allows all syncs to disk in a transaction to be pipelined. The next thing I want to test is the impact transactional pipelining has when run in a pathological deployment where a SQL gateway is very far away from the leaseholder but the leaseholder is able to establish a quick quorum. For instance, a SQL gateway could be in us-east but all replicas for a given table could be in us-west. In that case, transaction pipelining might be a pessimization because it requires an extra round trip to the leaseholder for each transaction when performing all QueryIntent requests immediately before committing. This will be the worst for transactions that only perform a single write without hitting the 1PC fast-path. It's unclear whether this will be noticeable and whether the extra round-trip will dominate the savings in latency elsewhere. |
b77ce75
to
b30f70f
Compare
These results look great. |
@benesch brought up a good point that in the case of a single write that doesn't hit the 1PC path the |
1349d93
to
990d5bc
Compare
This change simplifies the process of releasing quota back into the quota pool on failed proposals. It does so by pushing all quota management into Raft instead of having the request's goroutine try to determine whether it needs to release quota based on the error in the proposal result. Quota management used to only be handled by Raft for successful proposals. In doing so, the change removes a case where quota could be double freed: proposals that apply and then return an error. This isn't particularly important though, because this pattern is never used anymore. Besides being a simplification, this is important for cockroachdb#26599 because transaction pipelining depends on requests pushing proposals into Raft and then returning immediately. In that model (asyncConsensus) the request's goroutine is not around to release quota resources on failed proposals. Release note: None
990d5bc
to
2e80b45
Compare
efd99fa
to
b29fe50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained
Gopkg.lock, line 13 at r16 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
I think there is additional churn in whatever commit r16 is. It disappears again in r18. Amended the wrong commit maybe?
I'm only seeing this file 5de54ff7d4c9cff805c0b029c11765d3feb63ae3. Perhaps reviewable is showing the result of a rebase where I fixed some churn in here due to an old dep
version?
pkg/kv/txn_interceptor_pipeliner.go, line 33 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
nit: settings descriptions (typically, we're not consistent) start with a lowercase letter
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 102 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
"its"
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 108 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
"its"
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 111 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
"So far, none of these approaches have been integrated."
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 225 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
This seems oddly unidiomatic. Is there some reason not to just do
ba.Requests = append([]RequestUnion(nil), ba.Requests...)
?I guess your way is a lot shorter. shrug
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 283 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
"batch's"
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 315 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Can this ever be false? That is, since we always use
QueryIntentRequest_RETURN_ERROR
shouldn't we always see either an error or FoundIntent=true? Asking mostly for my own understanding.
Your understanding is correct, this should never be false. Added an assertion.
pkg/kv/txn_interceptor_pipeliner.go, line 482 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Heh, this is clever but deep in the weeds. Curious if you benchmarked to determine that this was necessary.
I was trying to avoid having this change introduce a new allocation per write. Probably in the weeds,
pkg/kv/txn_interceptor_pipeliner.go, line 489 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Do you need to clear? Can't you just truncate? AIUI it's all going to be invisible to the GC anyway.
The GC doesn't have an understanding of which elements in a backing array aren't referenced by any slices. In other words, it doesn't make a distinction between len
and cap
. As long as any of the array is referenced, the entire array along with all of the objects that its element pointers point to will be kept in memory.
Here's a forum post discussing this. I feel like I've seen a more authoritative one as well, but couldn't easily find it.
EDIT: here it is!
pkg/kv/txn_interceptor_pipeliner.go, line 38 at r20 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Ditto here.
Done.
pkg/roachpb/batch.go, line 202 at r18 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
optional nit: stray
Done.
pkg/roachpb/batch.go, line 385 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
"other commands". or maybe "other requests" for consistency with language in the next sentence?
Pulled this all into #27309.
pkg/roachpb/data.proto, line 393 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Maybe: "the sequence number of the request that created this write"?
Done.
pkg/sql/distsql_running.go, line 397 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
"types"
Fixing in #27066.
pkg/storage/replica.go, line 3369 at r18 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Does this belong in an earlier commit?
Done.
pkg/storage/replica.go, line 3372 at r18 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
I think it's clearer if you don't detach:
if len(proposal.Local.EndTxns) != 0
. Totally up to you.(It made me nervous initially to see you detaching endTxns and throwing them away.)
Tried to make this change but ran into annoyances with proposal.Local.EndTxns
being a pointer. Pulled into the conditional block to make it clearer that we're not accidentally forgetting about the variable.
pkg/storage/replica.go, line 3376 at r18 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
"asynchronous consensus" or "consensus asynchronously"
Done.
pkg/storage/replica.go, line 3385 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Does this belong in an earlier commit?
Done.
pkg/storage/replica_test.go, line 9266 at r18 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
"AsyncConsensus"
Done.
pkg/storage/replica_test.go, line 9283 at r18 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Is this preferred to a
testContext
? Legitimately asking, but seems like atestContext
would save you a lot of trouble in finding the replica you need below.
Yep, thanks!
pkg/storage/replica_test.go, line 9317 at r18 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
"expected"
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 the only major question I have is about the testing of outstandingWriteAlloc. Everything else looks good to go.
Reviewed 32 of 32 files at r21, 2 of 2 files at r22, 2 of 2 files at r23, 9 of 9 files at r24, 18 of 18 files at r25, 5 of 5 files at r26.
Dismissed @bdarnell and @vivekmenezes from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained
Gopkg.lock, line 13 at r16 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm only seeing this file 5de54ff7d4c9cff805c0b029c11765d3feb63ae3. Perhaps reviewable is showing the result of a rebase where I fixed some churn in here due to an old
dep
version?
Agreed. Seems OK now!
docs/generated/settings/settings.html, line 35 at r26 (raw file):
<tr><td><code>kv.transaction.max_intents_bytes</code></td><td>integer</td><td><code>256000</code></td><td>maximum number of bytes used to track write intents in transactions</td></tr> <tr><td><code>kv.transaction.max_refresh_spans_bytes</code></td><td>integer</td><td><code>256000</code></td><td>maximum number of bytes used to track refresh spans in serializable transactions</td></tr> <tr><td><code>kv.transaction.write_pipelining_enabled</code></td><td>boolean</td><td><code>true</code></td><td>if enabled, transactional writes are pipelined through Raft consensus</td></tr>
This line should be in an earlier commit if you happen to do another rebase.
pkg/kv/txn_interceptor_pipeliner.go, line 482 at r19 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I was trying to avoid having this change introduce a new allocation per write. Probably in the weeds,
Think you forgot to finish this thought. :)
My concern is that this logic seems entirely untested. I'd prefer that we either omit this complexity until proven necessary—though happy to defer to you if your experience suggests that this path is hot enough that every allocation counts, as my intuition here is spotty—or add a test that has a transaction with a massive write set.
pkg/kv/txn_interceptor_pipeliner.go, line 489 at r19 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The GC doesn't have an understanding of which elements in a backing array aren't referenced by any slices. In other words, it doesn't make a distinction between
len
andcap
. As long as any of the array is referenced, the entire array along with all of the objects that its element pointers point to will be kept in memory.Here's a forum post discussing this. I feel like I've seen a more authoritative one as well, but couldn't easily find it.
EDIT: here it is!
Oh, I see, there's a byte slice inside an outstandingWrite that you want to stop referencing. Sorry, thought it was all value types. Transactions are short-lived enough that I can't imagine it'd be particularly problematic to hold onto those byte slices until the end of the transaction in the worst case, but also the cost of zeroing this memory is probably negligible. Carry on.
pkg/kv/txn_interceptor_pipeliner.go, line 225 at r25 (raw file):
// We don't want to modify the batch's request slice directly, // so fork it before modifying it. Truncating the capacity will // force all future appends to re-allocate a new slice.
This comment is stale now. Drop the second sentence?
pkg/kv/txn_interceptor_pipeliner.go, line 33 at r26 (raw file):
var pipelinedWritesEnabled = settings.RegisterBoolSetting( "kv.transaction.write_pipelining_enabled", "if enabled, transactional writes are pipelined through Raft consensus",
This line should also be in an earlier commit if you happen to do another rebase. Seriously don't go out of your way.
pkg/kv/txn_interceptor_pipeliner_test.go, line 121 at r25 (raw file):
}) br, pErr := tp.SendLocked(context.Background(), ba)
Consider extracting this context, here and below, to the top of the test.
pkg/kv/txn_interceptor_pipeliner_test.go, line 175 at r25 (raw file):
br, pErr = tp.SendLocked(context.Background(), ba) require.NotNil(t, br) require.Equal(t, 4, len(br.Responses)) // QueryIntent response stripped
Consider checking the types of the four responses to make sure it's really the QueryIntent that gets stripped.
pkg/kv/txn_interceptor_pipeliner_test.go, line 238 at r25 (raw file):
txn := makeTxnProto() key, key2 := roachpb.Key("a"), roachpb.Key("c")
This code would be slightly easier to read if you named these variables keyA and keyC. No pressure.
pkg/kv/txn_interceptor_pipeliner_test.go, line 368 at r25 (raw file):
// chains onto the first outstanding write. tp.maybeInsertOutstandingWriteLocked(roachpb.Key("b"), 10) tp.maybeInsertOutstandingWriteLocked(roachpb.Key("d"), 11)
Please also test boundary conditions here! I.e., keys "a" (incl) and "c" (excl). That's been such a common source of bugs in my life recently.
pkg/kv/txn_interceptor_pipeliner_test.go, line 409 at r25 (raw file):
tp.epochBumpedLocked() require.Equal(t, 0, tp.outstandingWritesLen())
Theses tests are so readable. How did you sneak this require
package in so quietly? I've wanted something like this for so long. 🙌
pkg/storage/replica.go, line 3372 at r18 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Tried to make this change but ran into annoyances with
proposal.Local.EndTxns
being a pointer. Pulled into the conditional block to make it clearer that we're not accidentally forgetting about the variable.
Ah, nice. Thanks!
pkg/storage/engine/mvcc.go, line 1556 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. #27302.
Since #27302 was merged with @bdarnell's and @vivekmenzes's approvals, I've dismissed them from this discussion. I think that makes Reviewable hide this comment away. I think. Who really knows.
b29fe50
to
3a630e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/generated/settings/settings.html, line 35 at r26 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
This line should be in an earlier commit if you happen to do another rebase.
I think it's in the right commit. With 26 revisions on reviewable it's hard to see but it's certainly in 9908cd8.
pkg/kv/txn_interceptor_pipeliner.go, line 482 at r19 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Think you forgot to finish this thought. :)
My concern is that this logic seems entirely untested. I'd prefer that we either omit this complexity until proven necessary—though happy to defer to you if your experience suggests that this path is hot enough that every allocation counts, as my intuition here is spotty—or add a test that has a transaction with a massive write set.
Yeah, that's a good point. We were never forcing more than one re-alloc in the tests. Added a new TestTxnPipelinerManyWrites
test that adds a lot more writes to stress this.
As to whether this will move the needle on perf in absolute terms, I doubt it. That said, I set a goal of making this class as allocation-friendly as possible because the overhead here is all completely new and I'm trying to avoid any regressions for workloads that won't benefit from txn pipelining. In other words, I don't want there to be any reason to ever turn it off. Along those lines, maintaining an entirely new tree structure per txn seemed problematic enough that I felt it deserved a bit of optimization.
I didn't quite get everything down to zero-alloc per batch, but right now it allocates with respect to batch count, not individual write count (mod some amortized alloc costs). That's a pretty nice win and it makes me more confident that we're not introducing anything that might hurt perf for certain workloads and cause this entire thing to be a pessimization. Since I'm pretty confident that the outstandingWriteAlloc
is doing the right thing, I'd like to keep it.
pkg/kv/txn_interceptor_pipeliner.go, line 225 at r25 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
This comment is stale now. Drop the second sentence?
Done. Good catch.
pkg/kv/txn_interceptor_pipeliner.go, line 33 at r26 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
This line should also be in an earlier commit if you happen to do another rebase. Seriously don't go out of your way.
Same as above, I'm pretty sure it's in the correct commit already. I see it in r25.
pkg/kv/txn_interceptor_pipeliner_test.go, line 121 at r25 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Consider extracting this context, here and below, to the top of the test.
Done.
pkg/kv/txn_interceptor_pipeliner_test.go, line 175 at r25 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Consider checking the types of the four responses to make sure it's really the QueryIntent that gets stripped.
Done.
pkg/kv/txn_interceptor_pipeliner_test.go, line 238 at r25 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
This code would be slightly easier to read if you named these variables keyA and keyC. No pressure.
Done.
pkg/kv/txn_interceptor_pipeliner_test.go, line 368 at r25 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Please also test boundary conditions here! I.e., keys "a" (incl) and "c" (excl). That's been such a common source of bugs in my life recently.
Done.
pkg/kv/txn_interceptor_pipeliner_test.go, line 409 at r25 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Theses tests are so readable. How did you sneak this
require
package in so quietly? I've wanted something like this for so long. 🙌
We've actually had it imported for a while. However, we have traditionally avoided libraries like these because someone on the Go team decided one day that assertions are bad style and that Go is better than them. The more I work with require
, the more ridiculous this becomes to me.
For tests that are self-contained and sufficently "unit testy" like these, having a single line assertion makes it significantly easier to read and write, and empirically it results in much more thoroughly tested code. For instance, this file has 174 assertions at the moment. It would probably have about 25 if I needed to write out a 3 line assertion with an intelligible description of what went wrong for each.
And it's not like we don't just build these specialized assertions ourselves anyway. git grep 'func assert' \*_test.go
shows 18 instances of custom assertions, and git grep 'assert.* :=' \*_test.go
shows another 17. I bet a good chunk of these could be replaced with a single line call to require
and save everyone involved time and effort.
I also call bullshit on the "Time invested writing a good error message now pays off later when the test breaks."
point. How much more useful is "expected a to equal b, found 2, got 3" than assertion failed: 2 (expected) != 3 (actual)
with a line number pointing directly at the failing line? Either way, you need to visit the line of the assertion to figure out what wen't wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mod the problems you're already aware of. Good work, this has come out really nice!
Reviewed 7 of 13 files at r2, 1 of 33 files at r13, 32 of 32 files at r21, 2 of 2 files at r22, 2 of 2 files at r23, 9 of 9 files at r24, 15 of 18 files at r25, 5 of 5 files at r27, 5 of 5 files at r28.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/kv/dist_sender_server_test.go, line 2446 at r27 (raw file):
// send them again after refreshing. We also could avoid the // MixedSuccessError if we knew that the QueryIntent requests don't // write anything, so they're free to send again. Something needs
Do you think this blocks this PR? I agree that we're going to want to introduce the limited idempotency that would mostly do away with MixedSuccessError, but that seems to big a project to do in this PR. Is making QueryIntent exempt from foiling the retry as easy as not setting the boolean here for applicable batches?
pkg/kv/txn_coord_sender.go, line 390 at r27 (raw file):
// TransactionalSender is part of the TxnSenderFactory interface. func (tcf *TxnCoordSenderFactory) TransactionalSender( typ client.TxnType, meta roachpb.TxnCoordMeta,
Nice reduction.
pkg/kv/txn_coord_sender.go, line 428 at r27 (raw file):
wrapped: tcs.wrapped, } tcs.interceptorStack = [...]txnInterceptor{
At some point we should add a comment here that explains the ordering of the stack.
pkg/kv/txn_coord_sender.go, line 441 at r27 (raw file):
} tcs.augmentMetaLocked(meta)
Was this simply missing before?
pkg/kv/txn_interceptor_pipeliner.go, line 34 at r27 (raw file):
"kv.transaction.write_pipelining_enabled", "if enabled, transactional writes are pipelined through Raft consensus", true,
If this merges before next Monday (7/16, at which point I think we're picking the next alpha SHA), this should be false
(and set to true once we have the SHA).
pkg/kv/txn_interceptor_pipeliner.go, line 45 at r27 (raw file):
// before committing. These async writes are referred to as "outstanding writes" // and this process of proving that an outstanding write succeeded is called // "resolving" the write.
This may be the wrong place to make this suggestion, but why don't you call this "proving" the write? "Resolving a write" already has a meaning in the context of intent resolution.
pkg/kv/txn_interceptor_pipeliner.go, line 74 at r27 (raw file):
// evaluate in. // // The interceptor queries all outstanding writes before committing a
either resolves, or, if I had my way, proves.
pkg/kv/txn_interceptor_pipeliner.go, line 117 at r27 (raw file):
// to to be sent by the DistSender in parallel. This would help with this // issue by hiding the cost of the QueryIntent requests behind the cost of the // "staging" EndTransaction request.
It still wouldn't be optimal to defer the proof of all of them until the last minute (you're catching the tail latency that way, though hopefully that's in the noise compared to the EndTransaction consensus).
pkg/kv/txn_interceptor_pipeliner.go, line 118 at r27 (raw file):
// issue by hiding the cost of the QueryIntent requests behind the cost of the // "staging" EndTransaction request. //
Stray line?
pkg/kv/txn_interceptor_pipeliner.go, line 130 at r27 (raw file):
Suggestion:
outstandingWrites represent a commitment to proving (via QueryIntent) that the write succeeded in replicating an intent
pkg/kv/txn_interceptor_pipeliner.go, line 395 at r27 (raw file):
// Clear out the outstandingWrites set and free associated memory. if tp.outstandingWrites != nil { tp.outstandingWrites.Clear(true /* addNodesToFreelist */)
It's not clear to me when we pass true
here.
pkg/kv/txn_interceptor_pipeliner.go, line 449 at r27 (raw file):
w := item.(*outstandingWrite) if seq < w.Sequence { // The sequence might have change, which means that a new write was
changed
pkg/kv/txn_interceptor_pipeliner_test.go, line 409 at r25 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We've actually had it imported for a while. However, we have traditionally avoided libraries like these because someone on the Go team decided one day that assertions are bad style and that Go is better than them. The more I work with
require
, the more ridiculous this becomes to me.For tests that are self-contained and sufficently "unit testy" like these, having a single line assertion makes it significantly easier to read and write, and empirically it results in much more thoroughly tested code. For instance, this file has 174 assertions at the moment. It would probably have about 25 if I needed to write out a 3 line assertion with an intelligible description of what went wrong for each.
And it's not like we don't just build these specialized assertions ourselves anyway.
git grep 'func assert' \*_test.go
shows 18 instances of custom assertions, andgit grep 'assert.* :=' \*_test.go
shows another 17. I bet a good chunk of these could be replaced with a single line call torequire
and save everyone involved time and effort.I also call bullshit on the
"Time invested writing a good error message now pays off later when the test breaks."
point. How much more useful is "expected a to equal b, found 2, got 3" thanassertion failed: 2 (expected) != 3 (actual)
with a line number pointing directly at the failing line? Either way, you need to visit the line of the assertion to figure out what wen't wrong.
Yeah, I drank that Go cool-aid for a while and I've also come around to realizing that a compact sane amount of helpers (as used here) is the right thing to do.
pkg/kv/txn_interceptor_pipeliner_test.go, line 427 at r27 (raw file):
// TestTxnPipelinerManyWrites tests that a txnPipeliner behaves correctly even // when it's outstanding write tree grows to a very large size.
its
pkg/kv/txn_interceptor_pipeliner_test.go, line 522 at r27 (raw file):
}) }
cas
pkg/kv/txn_interceptor_pipeliner_test.go, line 536 at r27 (raw file):
require.Equal(t, 0, tp.outstandingWritesLen()) }
Could you add a test in which the cluster setting is flipped from off to on and back to off while the transaction processes? I don't think anything will go wrong there, but it would be good to cover that case as well.
Similarly, I don't think you're testing that a non-transactional request flushes the whole pipeline (why does it do that, by the way?)
pkg/roachpb/api.proto, line 1414 at r24 (raw file):
// TODO(nvanbenschoten): Handling cases where consensus fails would // be much more straightforward if all transactional requests were // idempotent. We could just re-issue requests.
Don't want to link #26915?
pkg/storage/replica_command.go, line 474 at r27 (raw file):
// commits, we'll write this data to the left-hand range in the merge // trigger. b = txn.NewBatch()
This is pretty subtle, did you see spurious failures here due to the intent missing? We do want to make sure to flush the pipeline before calling GetSnapshotForMergeRequest
but relying on the fact that it's run through the txn is too subtle for my taste, especially since at the interceptor level you rely on the fact that it doesn't have the IsTransactional flag set. This looks like it's just exploiting a bug; you shouldn't even be able to run a "non-transactioal request" through a transaction, right? I don't have an immediate suggestion here but this looks like a bit of a wart to me. It doesn't have to block this PR, though please add a test that verifies that GetSnapshotForMerge
has the right behavior (in case we break it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)
docs/generated/settings/settings.html, line 35 at r26 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think it's in the right commit. With 26 revisions on reviewable it's hard to see but it's certainly in 9908cd8.
Gah, thanks.
pkg/kv/txn_interceptor_pipeliner.go, line 482 at r19 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, that's a good point. We were never forcing more than one re-alloc in the tests. Added a new
TestTxnPipelinerManyWrites
test that adds a lot more writes to stress this.As to whether this will move the needle on perf in absolute terms, I doubt it. That said, I set a goal of making this class as allocation-friendly as possible because the overhead here is all completely new and I'm trying to avoid any regressions for workloads that won't benefit from txn pipelining. In other words, I don't want there to be any reason to ever turn it off. Along those lines, maintaining an entirely new tree structure per txn seemed problematic enough that I felt it deserved a bit of optimization.
I didn't quite get everything down to zero-alloc per batch, but right now it allocates with respect to batch count, not individual write count (mod some amortized alloc costs). That's a pretty nice win and it makes me more confident that we're not introducing anything that might hurt perf for certain workloads and cause this entire thing to be a pessimization. Since I'm pretty confident that the
outstandingWriteAlloc
is doing the right thing, I'd like to keep it.
Ah, that makes sense. SGTM.
pkg/kv/txn_interceptor_pipeliner.go, line 45 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
This may be the wrong place to make this suggestion, but why don't you call this "proving" the write? "Resolving a write" already has a meaning in the context of intent resolution.
Agreed. I like "proving" too.
pkg/storage/replica_command.go, line 474 at r27 (raw file):
I don't have an immediate suggestion here but this looks like a bit of a wart to me.
Perhaps it doesn't make sense in general for a non-transactional request to stall the pipeline. In that case we could manually add the QueryIntents to the GetSnapshotForMerge batch here and switch back to the non-txnal sender.
27443: vendor: update github.com/google/btree r=nvanbenschoten a=nvanbenschoten This picks up a `Clear` method on btree. Pulled out of #26599 because it was annoying during rebases. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
6a86d58
to
8a6e9ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! The biggest remaining question is whether we want txn pipelining making it into the next alpha fully enabled. I really want to get it in so that it gets an extra month of testing, but can understand reasons to not enable it yet. Perhaps a good compromise would be to get it in with a patch already prepared to disable it at a moment's notice if we see any issues in the next two weeks of prep. I'm on support next week, so I can be extra dilligent about detecting fallout from the change.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/kv/txn_coord_sender.go, line 390 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Nice reduction.
Done.
pkg/kv/txn_coord_sender.go, line 428 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
At some point we should add a comment here that explains the ordering of the stack.
Agreed. I'll add one in #27113.
pkg/kv/txn_coord_sender.go, line 441 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Was this simply missing before?
We used to only set the proto, but now that we're inflating txnCoordSenders with TxnCoordMeta's, we need a bit more. This was added in #27066.
pkg/kv/txn_interceptor_pipeliner.go, line 34 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
If this merges before next Monday (7/16, at which point I think we're picking the next alpha SHA), this should be
false
(and set to true once we have the SHA).
I was actually shooting to get this into the next alpha so that it gets as much immediate testing as possible. You think I should wait a month to see if there's any fallout?
pkg/kv/txn_interceptor_pipeliner.go, line 45 at r27 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Agreed. I like "proving" too.
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 74 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
either resolves, or, if I had my way, proves.
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 117 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
It still wouldn't be optimal to defer the proof of all of them until the last minute (you're catching the tail latency that way, though hopefully that's in the noise compared to the EndTransaction consensus).
Yeah that's definitely true.
pkg/kv/txn_interceptor_pipeliner.go, line 118 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Stray line?
No, I've gotten in a habit of adding an extra line at the end of an extra long comments so they don't "swallow" the type def. I guess it's a matter of taste.
pkg/kv/txn_interceptor_pipeliner.go, line 130 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Suggestion:
outstandingWrites represent a commitment to proving (via QueryIntent) that the write succeeded in replicating an intent
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 395 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
It's not clear to me when we pass
true
here.
Done.
pkg/kv/txn_interceptor_pipeliner.go, line 449 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
changed
Done.
pkg/kv/txn_interceptor_pipeliner_test.go, line 427 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
its
Done.
pkg/kv/txn_interceptor_pipeliner_test.go, line 522 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
cas
?
pkg/kv/txn_interceptor_pipeliner_test.go, line 536 at r27 (raw file):
Could you add a test in which the cluster setting is flipped from off to on and back to off while the transaction processes? I don't think anything will go wrong there, but it would be good to cover that case as well.
Done.
Similarly, I don't think you're testing that a non-transactional request flushes the whole pipeline (why does it do that, by the way?)
Done. Added a comment.
pkg/roachpb/api.proto, line 1414 at r24 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Don't want to link #26915?
Done.
pkg/storage/replica_command.go, line 474 at r27 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
I don't have an immediate suggestion here but this looks like a bit of a wart to me.
Perhaps it doesn't make sense in general for a non-transactional request to stall the pipeline. In that case we could manually add the QueryIntents to the GetSnapshotForMerge batch here and switch back to the non-txnal sender.
Added a test and a comment here. The reason why I think it makes sense for non-transaction requests to stall the txn's pipeline is because their request headers are often insufficient to determine all of the keys that they will interact with.
I agree that this is a little subtle, but also don't want it to stall the entire PR. I'll add a TODO.
pkg/kv/dist_sender_server_test.go, line 2446 at r27 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Do you think this blocks this PR? I agree that we're going to want to introduce the limited idempotency that would mostly do away with MixedSuccessError, but that seems to big a project to do in this PR. Is making QueryIntent exempt from foiling the retry as easy as not setting the boolean here for applicable batches?
Done.
This change introduces an assertion that a successful batch with at least one transactional write request always results in a Raft command. This is used to ensure that no writing request silently no-ops. The assertion would have fired had the InitPut optimization not been removed in the previous commit. Release note: None
This change adds a new AsyncConsensus flag to the BatchRequest Header. If set, the request will return to the client before proposing the request into Raft. All consensus processing will be performed asynchronously. Because consensus may fail, this means that the request cannot be expected to succeed. Instead, its success must be verified. This will be used by the upcoming txnPipeliner to pipeline transactional writes. Release note: None
This change pipelines transactional writes using the approach presented in cockroachdb#16026 (comment). The change introduces transactional pipelining by creating a new `txnReqInterceptor` that hooks into the `TxnCoordSender` called `txnPipeliner`. txnPipeliner is a txnInterceptor that pipelines transactional writes by using asynchronous consensus. The interceptor then tracks all writes that have been asynchronously proposed through Raft and ensures that all interfering requests chain on to them by first proving that the async writes succeeded. The interceptor also ensures that when committing a transaction all writes that have been proposed but not proven to have succeeded are first checked before committing. These async writes are referred to as "outstanding writes" and this process of proving that an outstanding write succeeded is called "resolving" the write. Chaining on to in-flight async writes is important for two main reasons to txnPipeliner: 1. requests proposed to Raft will not necessarily succeed. For any number of reasons, the request may make it through Raft and be discarded or fail to ever even be replicated. A transaction must check that all async writes succeeded before committing. However, when these proposals do fail, their errors aren't particularly interesting to a transaction. This is because these errors are not deterministic Transaction-domain errors that a transaction must adhere to for correctness such as conditional-put errors or other symptoms of constraint violations. These kinds of errors are all discovered during write *evaluation*, which an async write will perform synchronously before consensus. Any error during consensus is outside of the Transaction-domain and can always trigger a transaction retry. 2. transport layers beneath the txnPipeliner do not provide strong enough ordering guarantees between concurrent requests in the same transaction to avoid needing explicit chaining. For instance, DistSender uses unary gRPC requests instead of gRPC streams, so it can't natively expose strong ordering guarantees. Perhaps more importantly, even when a command has entered the command queue and evaluated on a Replica, it is not guaranteed to be applied before interfering commands. This is because the command may be retried outside of the serialization of the command queue for any number of reasons, such as leaseholder changes. When the command re-enters the command queue, it's possible that interfering commands may jump ahead of it. To combat this, the txnPipeliner uses chaining to throw an error when these re-orderings would have affected the order that transactional requests evaluate in. The first sanity check benchmark was to run the change against a geo-distributed cluster running TPCC. A node was located in each of us-east1-b, us-west1-b, europe-west2-b. The load generator was located in us-east1-b and all leases were moved to this zone as well. The test first limited all operations to newOrders, but soon expanded to full TPCC after seeing desirable results. Without txn pipelining ``` _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 180.0s 198.3 154.2% 916.5 486.5 2818.6 4160.7 5637.1 5905.6 ``` With txn pipelining ``` _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 180.0s 200.0 155.5% 388.2 184.5 1208.0 1946.2 2684.4 2818.6 ``` One big caveat is that all foreign key checks needed to be disabled for this experiment to show a significant improvement. The reason for this is subtle. TPCC's schema is structured as a large hierarchy of tables, and most of its transactions insert down a lineage of tables. With FKs enabled, we see a query to a table immediately after it is inserted into when its child table is inserted into next. This effectively causes a pipeline stall immediately after each write, which eliminated a lot of the benefit that we expect to see here. Luckily, this is a problem that can be avoided by being a little smarter about foreign keys at the SQL-level. We should not be performing these scans over all possible column families for a row just to check for existence (resulting kv count > 0) if we know that we just inserted into that row. We need some kind of row-level existence cache in SQL to avoid these scans. This will be generally useful, but will be especially useful to avoid these pipeline stalls. Release note: None
This commit adds a new `kv.transaction.write_pipelining_max_batch_size` setting which can be used to bound the number of writes we permit in a batch that uses async consensus. This is useful because we'll have to resolve each write that uses async consensus using a QueryIntent, so there's a point where it makes more sense to just perform consensus for the entire batch synchronously and avoid all of the overhead of pipelining. This crossover point is hard to determine experimentally and depends on a lot of factors. It's better just to provide this as an option. Release note: None
8a6e9ec
to
11ca84e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/kv/txn_interceptor_pipeliner.go, line 34 at r27 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I was actually shooting to get this into the next alpha so that it gets as much immediate testing as possible. You think I should wait a month to see if there's any fallout?
The next alpha SHA isn't until the following Monday, 7/23. So this will have the entire stability week as well.
bors r+ |
26599: kv: pipeline transactional writes r=nvanbenschoten a=nvanbenschoten This change pipelines transactional writes using the approach presented in #16026 (comment). ## Approach The change introduces transactional pipelining by creating a new `txnReqInterceptor` that hooks into the `TxnCoordSender` called `txnPipeliner`. `txnPipeliner` is a `txnReqInterceptor` that pipelines transactional writes by using asynchronous consensus. The interceptor then tracks all writes that have been asynchronously proposed through Raft and ensures that all interfering requests chain on to them by first proving that the async writes succeeded. The interceptor also ensures that when committing a transaction all outstanding writes that have been proposed but not proven to have succeeded are first checked before committing. Chaining on to in-flight async writes is important for two main reasons to txnPipeliner: 1. requests proposed to Raft will not necessarily succeed. For any number of reasons, the request may make it through Raft and be discarded or fail to ever even be replicated. A transaction must check that all async writes succeeded before committing. However, when these proposals do fail, their errors aren't particularly interesting to a transaction. This is because these errors are not deterministic Transaction-domain errors that a transaction must adhere to for correctness such as conditional-put errors or other symptoms of constraint violations. These kinds of errors are all discovered during write *evaluation*, which an async write will perform synchronously before consensus. Any error during consensus is outside of the Transaction-domain and can always trigger a transaction retry. 2. transport layers beneath the txnPipeliner do not provide strong enough ordering guarantees between concurrent requests in the same transaction to avoid needing explicit chaining. For instance, DistSender uses unary gRPC requests instead of gRPC streams, so it can't natively expose strong ordering guarantees. Perhaps more importantly, even when a command has entered the command queue and evaluated on a Replica, it is not guaranteed to be applied before interfering commands. This is because the command may be retried outside of the serialization of the command queue for any number of reasons, such as leaseholder changes. When the command re-enters the command queue, it's possible that interfering commands may jump ahead of it. To combat this, the txnPipeliner uses chaining to throw an error when these re-orderings would have affected the order that transactional requests evaluate in. ## Testing/Benchmarking This change will require a lot of testing (warning: very little unit testing is included so far) and benchmarking. The first batch of benchmarking has been very promising in terms of reducing transactional latency, but it hasn't shown much effect on transactional throughput. This is somewhat expected, as the approach to transactional pipelining is hits a tradeoff between a transaction performing slightly more work (in parallel) while also having a significantly smaller contention footprint. The first sanity check benchmark was to run the change against a geo-distributed cluster running TPCC. A node was located in each of us-east1-b, us-west1-b, europe-west2-b. The load generator was located in us-east1-b and all leases were moved to this zone as well. The test first limited all operations to newOrders, but soon expanded to full TPCC after seeing desirable results. #### Without txn pipelining ``` _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 180.0s 198.3 154.2% 916.5 486.5 2818.6 4160.7 5637.1 5905.6 ``` #### With txn pipelining ``` _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 180.0s 200.0 155.5% 388.2 184.5 1208.0 1946.2 2684.4 2818.6 ``` One big caveat is that all foreign key checks needed to be disabled for this experiment to show much of an improvement. The reason for this is subtle. TPCC's schema is structured as a large hierarchy of tables, and most of its transactions insert down a lineage of tables. With FKs enabled, we see a query to a table immediately after it is inserted into when its child table is inserted into next. This effectively causes a pipeline stall immediately after each write, which eliminated a lot of the benefit that we expect to see here. Luckily, this is a problem that can be avoided by being a little smarter about foreign keys at the SQL-level. We should not be performing these scans over all possible column families for a row just to check for existence (resulting kv count > 0) if we know that we just inserted into that row. We need some kind of row-level existence cache in SQL to avoid these scans. This will be generally useful, but will be especially useful to avoid these pipeline stalls. Co-authored-by: Nathan VanBenschoten <[email protected]>
Build succeeded |
With you on support rotation and a cluster variable to turn the thing off
in place, you're right that we should include this in the alpha. Really
exciting to see this land!
On Fri, Jul 13, 2018 at 9:17 PM craig[bot] ***@***.***> wrote:
Merged #26599 <#26599>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#26599 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135PlGttqjIBbCldFvGOlLMAJefrgkks5uGQ5UgaJpZM4Ui-yN>
.
--
…-- Tobias
|
+1 to enabling this by default for the alpha, but none of these commits had release notes. Make sure that this gets a mention (both the performance improvement and the off switch) when the alpha release notes are written. |
Thanks for catching that @bdarnell. Added in cockroachdb/docs#3422 (comment). |
This change pipelines transactional writes using the approach presented
in #16026 (comment).
Approach
The change introduces transactional pipelining by creating a new
txnReqInterceptor
that hooks into theTxnCoordSender
calledtxnPipeliner
.txnPipeliner
is atxnReqInterceptor
that pipelines transactional writes byusing asynchronous consensus. The interceptor then tracks all writes that
have been asynchronously proposed through Raft and ensures that all
interfering requests chain on to them by first proving that the async writes
succeeded. The interceptor also ensures that when committing a transaction
all outstanding writes that have been proposed but not proven to have
succeeded are first checked before committing.
Chaining on to in-flight async writes is important for two main reasons to
txnPipeliner:
reasons, the request may make it through Raft and be discarded or fail to
ever even be replicated. A transaction must check that all async writes
succeeded before committing. However, when these proposals do fail, their
errors aren't particularly interesting to a transaction. This is because
these errors are not deterministic Transaction-domain errors that a
transaction must adhere to for correctness such as conditional-put errors or
other symptoms of constraint violations. These kinds of errors are all
discovered during write evaluation, which an async write will perform
synchronously before consensus. Any error during consensus is outside of the
Transaction-domain and can always trigger a transaction retry.
ordering guarantees between concurrent requests in the same transaction to
avoid needing explicit chaining. For instance, DistSender uses unary gRPC
requests instead of gRPC streams, so it can't natively expose strong ordering
guarantees. Perhaps more importantly, even when a command has entered the
command queue and evaluated on a Replica, it is not guaranteed to be applied
before interfering commands. This is because the command may be retried
outside of the serialization of the command queue for any number of reasons,
such as leaseholder changes. When the command re-enters the command queue,
it's possible that interfering commands may jump ahead of it. To combat
this, the txnPipeliner uses chaining to throw an error when these
re-orderings would have affected the order that transactional requests
evaluate in.
Testing/Benchmarking
This change will require a lot of testing (warning: very little unit testing is
included so far) and benchmarking. The first batch of benchmarking has been very
promising in terms of reducing transactional latency, but it hasn't shown much
effect on transactional throughput. This is somewhat expected, as the approach
to transactional pipelining is hits a tradeoff between a transaction performing
slightly more work (in parallel) while also having a significantly smaller
contention footprint.
The first sanity check benchmark was to run the change against a geo-distributed
cluster running TPCC. A node was located in each of us-east1-b, us-west1-b,
europe-west2-b. The load generator was located in us-east1-b and all leases were
moved to this zone as well. The test first limited all operations to
newOrders, but soon expanded to full TPCC after seeing desirable results.
Without txn pipelining
With txn pipelining
One small caveat is that all foreign key checks needed to be disabled for this
experiment to show quiet as big of an improvement. The reason for this is subtle.
TPCC's schema is structured as a large hierarchy of tables, and most of its
transactions insert down a lineage of tables. With FKs enabled, we see a query
to a table immediately after it is inserted into when its child table is
inserted into next. This effectively causes a pipeline stall immediately after
each write, which eliminated a lot of the benefit that we expect to see here.
Luckily, this is a problem that can be avoided by being a little smarter about
foreign keys at the SQL-level. We should not be performing these scans over all
possible column families for a row just to check for existence (resulting kv
count > 0) if we know that we just inserted into that row. We need some kind of
row-level existence cache in SQL to avoid these scans. This will be generally
useful, but will be especially useful to avoid these pipeline stalls.