Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: pipeline transactional writes #26599

Merged
merged 4 commits into from
Jul 13, 2018

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Jun 11, 2018

This change pipelines transactional writes using the approach presented
in #16026 (comment).

Approach

The change introduces transactional pipelining by creating a new
txnReqInterceptor that hooks into the TxnCoordSender called txnPipeliner.
txnPipeliner is a txnReqInterceptor that pipelines transactional writes by
using asynchronous consensus. The interceptor then tracks all writes that
have been asynchronously proposed through Raft and ensures that all
interfering requests chain on to them by first proving that the async writes
succeeded. The interceptor also ensures that when committing a transaction
all outstanding writes that have been proposed but not proven to have
succeeded are first checked before committing.

Chaining on to in-flight async writes is important for two main reasons to
txnPipeliner:

  1. requests proposed to Raft will not necessarily succeed. For any number of
    reasons, the request may make it through Raft and be discarded or fail to
    ever even be replicated. A transaction must check that all async writes
    succeeded before committing. However, when these proposals do fail, their
    errors aren't particularly interesting to a transaction. This is because
    these errors are not deterministic Transaction-domain errors that a
    transaction must adhere to for correctness such as conditional-put errors or
    other symptoms of constraint violations. These kinds of errors are all
    discovered during write evaluation, which an async write will perform
    synchronously before consensus. Any error during consensus is outside of the
    Transaction-domain and can always trigger a transaction retry.
  2. transport layers beneath the txnPipeliner do not provide strong enough
    ordering guarantees between concurrent requests in the same transaction to
    avoid needing explicit chaining. For instance, DistSender uses unary gRPC
    requests instead of gRPC streams, so it can't natively expose strong ordering
    guarantees. Perhaps more importantly, even when a command has entered the
    command queue and evaluated on a Replica, it is not guaranteed to be applied
    before interfering commands. This is because the command may be retried
    outside of the serialization of the command queue for any number of reasons,
    such as leaseholder changes. When the command re-enters the command queue,
    it's possible that interfering commands may jump ahead of it. To combat
    this, the txnPipeliner uses chaining to throw an error when these
    re-orderings would have affected the order that transactional requests
    evaluate in.

Testing/Benchmarking

This change will require a lot of testing (warning: very little unit testing is
included so far) and benchmarking. The first batch of benchmarking has been very
promising in terms of reducing transactional latency, but it hasn't shown much
effect on transactional throughput. This is somewhat expected, as the approach
to transactional pipelining is hits a tradeoff between a transaction performing
slightly more work (in parallel) while also having a significantly smaller
contention footprint.

The first sanity check benchmark was to run the change against a geo-distributed
cluster running TPCC. A node was located in each of us-east1-b, us-west1-b,
europe-west2-b. The load generator was located in us-east1-b and all leases were
moved to this zone as well. The test first limited all operations to
newOrders, but soon expanded to full TPCC after seeing desirable results.

Without txn pipelining

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  180.0s      198.3 154.2%    916.5    486.5   2818.6   4160.7   5637.1   5905.6

With txn pipelining

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  180.0s      200.0 155.5%    388.2    184.5   1208.0   1946.2   2684.4   2818.6

One small caveat is that all foreign key checks needed to be disabled for this
experiment to show quiet as big of an improvement. The reason for this is subtle.
TPCC's schema is structured as a large hierarchy of tables, and most of its
transactions insert down a lineage of tables. With FKs enabled, we see a query
to a table immediately after it is inserted into when its child table is
inserted into next. This effectively causes a pipeline stall immediately after
each write, which eliminated a lot of the benefit that we expect to see here.
Luckily, this is a problem that can be avoided by being a little smarter about
foreign keys at the SQL-level. We should not be performing these scans over all
possible column families for a row just to check for existence (resulting kv
count > 0) if we know that we just inserted into that row. We need some kind of
row-level existence cache in SQL to avoid these scans. This will be generally
useful, but will be especially useful to avoid these pipeline stalls.

@nvanbenschoten nvanbenschoten requested a review from a team June 11, 2018 16:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch from d4d02c7 to ef17628 Compare June 22, 2018 21:16
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 26, 2018
This change simplifies the process of releasing quota back into the
quota pool on failed proposals. It does so by pushing all quota
management into Raft instead of having the request's goroutine
try to determine whether it needs to release quota based on the
error in the proposal result. Quota management used to only be
handled by Raft for successful proposals.

In doing so, the change removes a case where quota could be double
freed: proposals that apply and then return an error. This isn't
particularly important though, because this pattern is never
used anymore.

Besides being a simplification, this is important for cockroachdb#26599 because
transaction pipelining depends on requests pushing proposals into
Raft and then returning immediately. In that model (asyncConsensus)
the request's goroutine is not around to release quota resources on
failed proposals.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch 2 times, most recently from e89c2d1 to f6ebac7 Compare June 26, 2018 22:39
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch from f6ebac7 to 2e6b47b Compare June 26, 2018 22:40
@nvanbenschoten nvanbenschoten changed the title [prototype] kv: pipeline transactional writes kv: pipeline transactional writes Jun 28, 2018
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch from 2e6b47b to b77ce75 Compare June 28, 2018 22:07
@nvanbenschoten nvanbenschoten requested review from a team June 28, 2018 22:07
@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Jun 28, 2018

This is now rebased on top of #27066. All commits other than the first are part of this PR and it is ready to be reviewed!

I'm starting to perform a more in-depth performance comparison with this change. It's pretty clear that this is beneficial for any geo-distributed cluster where consensus latency dominates all other costs. However, until now it hasn't been clear how this will fare on more standard deployments. I went through each workload in our workload generator and performed a series of benchmarks where I turned pipelining on and off using the new kv.transaction.write_pipelining_enabled cluster setting. The results were very promising. I wasn't expecting this much of a throughput improvement, especially in single DC settings.

Note: all tests were run with n1-standard-16 machines in us-east-1b

kv

Non-transactional, not affected.

bank

Non-transactional, not affected

tpcc

1 node with fks

tpcc-1node-fks

3 nodes with fks

screen shot 2018-06-28 at 3 46 00 pm

screen shot 2018-06-28 at 3 45 51 pm

screen shot 2018-06-28 at 3 47 44 pm

ledger

3 nodes

screen shot 2018-06-28 at 2 10 43 am

Note: the periodic dips (every 70 seconds) corresponded exactly with RocksDB compactions and were present with and without pipelining

As the various benchmark runs show, even without a distributed cluster with high consensus round-trip latency, transactional pipelining still significantly improves throughput. For tpcc, this improvement in throughput was around 35%. For ledger, it was around 30%. This is almost certainly because it allows all syncs to disk in a transaction to be pipelined.

The next thing I want to test is the impact transactional pipelining has when run in a pathological deployment where a SQL gateway is very far away from the leaseholder but the leaseholder is able to establish a quick quorum. For instance, a SQL gateway could be in us-east but all replicas for a given table could be in us-west. In that case, transaction pipelining might be a pessimization because it requires an extra round trip to the leaseholder for each transaction when performing all QueryIntent requests immediately before committing. This will be the worst for transactions that only perform a single write without hitting the 1PC fast-path. It's unclear whether this will be noticeable and whether the extra round-trip will dominate the savings in latency elsewhere.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch from b77ce75 to b30f70f Compare June 28, 2018 22:10
@tbg
Copy link
Member

tbg commented Jun 28, 2018

These results look great.

@nvanbenschoten
Copy link
Member Author

For instance, a SQL gateway could be in us-east but all replicas for a given table could be in us-west. In that case, transaction pipelining might be a pessimization because it requires an extra round trip to the leaseholder for each transaction when performing all QueryIntent requests immediately before committing. This will be the worst for transactions that only perform a single write without hitting the 1PC fast-path.

@benesch brought up a good point that in the case of a single write that doesn't hit the 1PC path the QueryIntent request can be tacked on to the front of the EndTransaction request instead of being split by DistSender, so even in that case, this should be a win.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch 2 times, most recently from 1349d93 to 990d5bc Compare June 28, 2018 23:31
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 28, 2018
This change simplifies the process of releasing quota back into the
quota pool on failed proposals. It does so by pushing all quota
management into Raft instead of having the request's goroutine
try to determine whether it needs to release quota based on the
error in the proposal result. Quota management used to only be
handled by Raft for successful proposals.

In doing so, the change removes a case where quota could be double
freed: proposals that apply and then return an error. This isn't
particularly important though, because this pattern is never
used anymore.

Besides being a simplification, this is important for cockroachdb#26599 because
transaction pipelining depends on requests pushing proposals into
Raft and then returning immediately. In that model (asyncConsensus)
the request's goroutine is not around to release quota resources on
failed proposals.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch from 990d5bc to 2e80b45 Compare June 29, 2018 00:20
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


Gopkg.lock, line 13 at r16 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

I think there is additional churn in whatever commit r16 is. It disappears again in r18. Amended the wrong commit maybe?

I'm only seeing this file 5de54ff7d4c9cff805c0b029c11765d3feb63ae3. Perhaps reviewable is showing the result of a rebase where I fixed some churn in here due to an old dep version?


pkg/kv/txn_interceptor_pipeliner.go, line 33 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

nit: settings descriptions (typically, we're not consistent) start with a lowercase letter

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 102 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

"its"

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 108 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

"its"

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 111 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

"So far, none of these approaches have been integrated."

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 225 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

This seems oddly unidiomatic. Is there some reason not to just do ba.Requests = append([]RequestUnion(nil), ba.Requests...)?

I guess your way is a lot shorter. shrug

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 283 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

"batch's"

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 315 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Can this ever be false? That is, since we always use QueryIntentRequest_RETURN_ERROR shouldn't we always see either an error or FoundIntent=true? Asking mostly for my own understanding.

Your understanding is correct, this should never be false. Added an assertion.


pkg/kv/txn_interceptor_pipeliner.go, line 482 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Heh, this is clever but deep in the weeds. Curious if you benchmarked to determine that this was necessary.

I was trying to avoid having this change introduce a new allocation per write. Probably in the weeds,


pkg/kv/txn_interceptor_pipeliner.go, line 489 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Do you need to clear? Can't you just truncate? AIUI it's all going to be invisible to the GC anyway.

The GC doesn't have an understanding of which elements in a backing array aren't referenced by any slices. In other words, it doesn't make a distinction between len and cap. As long as any of the array is referenced, the entire array along with all of the objects that its element pointers point to will be kept in memory.

Here's a forum post discussing this. I feel like I've seen a more authoritative one as well, but couldn't easily find it.

EDIT: here it is!


pkg/kv/txn_interceptor_pipeliner.go, line 38 at r20 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Ditto here.

Done.


pkg/roachpb/batch.go, line 202 at r18 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

optional nit: stray

Done.


pkg/roachpb/batch.go, line 385 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

"other commands". or maybe "other requests" for consistency with language in the next sentence?

Pulled this all into #27309.


pkg/roachpb/data.proto, line 393 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Maybe: "the sequence number of the request that created this write"?

Done.


pkg/sql/distsql_running.go, line 397 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

"types"

Fixing in #27066.


pkg/storage/replica.go, line 3369 at r18 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Does this belong in an earlier commit?

Done.


pkg/storage/replica.go, line 3372 at r18 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

I think it's clearer if you don't detach:if len(proposal.Local.EndTxns) != 0. Totally up to you.

(It made me nervous initially to see you detaching endTxns and throwing them away.)

Tried to make this change but ran into annoyances with proposal.Local.EndTxns being a pointer. Pulled into the conditional block to make it clearer that we're not accidentally forgetting about the variable.


pkg/storage/replica.go, line 3376 at r18 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

"asynchronous consensus" or "consensus asynchronously"

Done.


pkg/storage/replica.go, line 3385 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Does this belong in an earlier commit?

Done.


pkg/storage/replica_test.go, line 9266 at r18 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

"AsyncConsensus"

Done.


pkg/storage/replica_test.go, line 9283 at r18 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Is this preferred to a testContext? Legitimately asking, but seems like a testContext would save you a lot of trouble in finding the replica you need below.

Yep, thanks!


pkg/storage/replica_test.go, line 9317 at r18 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

"expected"

Done.

Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 the only major question I have is about the testing of outstandingWriteAlloc. Everything else looks good to go.

Reviewed 32 of 32 files at r21, 2 of 2 files at r22, 2 of 2 files at r23, 9 of 9 files at r24, 18 of 18 files at r25, 5 of 5 files at r26.
Dismissed @bdarnell and @vivekmenezes from 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


Gopkg.lock, line 13 at r16 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm only seeing this file 5de54ff7d4c9cff805c0b029c11765d3feb63ae3. Perhaps reviewable is showing the result of a rebase where I fixed some churn in here due to an old dep version?

Agreed. Seems OK now!


docs/generated/settings/settings.html, line 35 at r26 (raw file):

<tr><td><code>kv.transaction.max_intents_bytes</code></td><td>integer</td><td><code>256000</code></td><td>maximum number of bytes used to track write intents in transactions</td></tr>
<tr><td><code>kv.transaction.max_refresh_spans_bytes</code></td><td>integer</td><td><code>256000</code></td><td>maximum number of bytes used to track refresh spans in serializable transactions</td></tr>
<tr><td><code>kv.transaction.write_pipelining_enabled</code></td><td>boolean</td><td><code>true</code></td><td>if enabled, transactional writes are pipelined through Raft consensus</td></tr>

This line should be in an earlier commit if you happen to do another rebase.


pkg/kv/txn_interceptor_pipeliner.go, line 482 at r19 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I was trying to avoid having this change introduce a new allocation per write. Probably in the weeds,

Think you forgot to finish this thought. :)

My concern is that this logic seems entirely untested. I'd prefer that we either omit this complexity until proven necessary—though happy to defer to you if your experience suggests that this path is hot enough that every allocation counts, as my intuition here is spotty—or add a test that has a transaction with a massive write set.


pkg/kv/txn_interceptor_pipeliner.go, line 489 at r19 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The GC doesn't have an understanding of which elements in a backing array aren't referenced by any slices. In other words, it doesn't make a distinction between len and cap. As long as any of the array is referenced, the entire array along with all of the objects that its element pointers point to will be kept in memory.

Here's a forum post discussing this. I feel like I've seen a more authoritative one as well, but couldn't easily find it.

EDIT: here it is!

Oh, I see, there's a byte slice inside an outstandingWrite that you want to stop referencing. Sorry, thought it was all value types. Transactions are short-lived enough that I can't imagine it'd be particularly problematic to hold onto those byte slices until the end of the transaction in the worst case, but also the cost of zeroing this memory is probably negligible. Carry on.


pkg/kv/txn_interceptor_pipeliner.go, line 225 at r25 (raw file):

				// We don't want to modify the batch's request slice directly,
				// so fork it before modifying it. Truncating the capacity will
				// force all future appends to re-allocate a new slice.

This comment is stale now. Drop the second sentence?


pkg/kv/txn_interceptor_pipeliner.go, line 33 at r26 (raw file):

var pipelinedWritesEnabled = settings.RegisterBoolSetting(
	"kv.transaction.write_pipelining_enabled",
	"if enabled, transactional writes are pipelined through Raft consensus",

This line should also be in an earlier commit if you happen to do another rebase. Seriously don't go out of your way.


pkg/kv/txn_interceptor_pipeliner_test.go, line 121 at r25 (raw file):

	})

	br, pErr := tp.SendLocked(context.Background(), ba)

Consider extracting this context, here and below, to the top of the test.


pkg/kv/txn_interceptor_pipeliner_test.go, line 175 at r25 (raw file):

	br, pErr = tp.SendLocked(context.Background(), ba)
	require.NotNil(t, br)
	require.Equal(t, 4, len(br.Responses)) // QueryIntent response stripped

Consider checking the types of the four responses to make sure it's really the QueryIntent that gets stripped.


pkg/kv/txn_interceptor_pipeliner_test.go, line 238 at r25 (raw file):

	txn := makeTxnProto()
	key, key2 := roachpb.Key("a"), roachpb.Key("c")

This code would be slightly easier to read if you named these variables keyA and keyC. No pressure.


pkg/kv/txn_interceptor_pipeliner_test.go, line 368 at r25 (raw file):

	// chains onto the first outstanding write.
	tp.maybeInsertOutstandingWriteLocked(roachpb.Key("b"), 10)
	tp.maybeInsertOutstandingWriteLocked(roachpb.Key("d"), 11)

Please also test boundary conditions here! I.e., keys "a" (incl) and "c" (excl). That's been such a common source of bugs in my life recently.


pkg/kv/txn_interceptor_pipeliner_test.go, line 409 at r25 (raw file):

	tp.epochBumpedLocked()
	require.Equal(t, 0, tp.outstandingWritesLen())

Theses tests are so readable. How did you sneak this require package in so quietly? I've wanted something like this for so long. 🙌


pkg/storage/replica.go, line 3372 at r18 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Tried to make this change but ran into annoyances with proposal.Local.EndTxns being a pointer. Pulled into the conditional block to make it clearer that we're not accidentally forgetting about the variable.

Ah, nice. Thanks!


pkg/storage/engine/mvcc.go, line 1556 at r12 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Done. #27302.

Since #27302 was merged with @bdarnell's and @vivekmenzes's approvals, I've dismissed them from this discussion. I think that makes Reviewable hide this comment away. I think. Who really knows.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch from b29fe50 to 3a630e5 Compare July 12, 2018 05:52
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/generated/settings/settings.html, line 35 at r26 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

This line should be in an earlier commit if you happen to do another rebase.

I think it's in the right commit. With 26 revisions on reviewable it's hard to see but it's certainly in 9908cd8.


pkg/kv/txn_interceptor_pipeliner.go, line 482 at r19 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Think you forgot to finish this thought. :)

My concern is that this logic seems entirely untested. I'd prefer that we either omit this complexity until proven necessary—though happy to defer to you if your experience suggests that this path is hot enough that every allocation counts, as my intuition here is spotty—or add a test that has a transaction with a massive write set.

Yeah, that's a good point. We were never forcing more than one re-alloc in the tests. Added a new TestTxnPipelinerManyWrites test that adds a lot more writes to stress this.

As to whether this will move the needle on perf in absolute terms, I doubt it. That said, I set a goal of making this class as allocation-friendly as possible because the overhead here is all completely new and I'm trying to avoid any regressions for workloads that won't benefit from txn pipelining. In other words, I don't want there to be any reason to ever turn it off. Along those lines, maintaining an entirely new tree structure per txn seemed problematic enough that I felt it deserved a bit of optimization.

I didn't quite get everything down to zero-alloc per batch, but right now it allocates with respect to batch count, not individual write count (mod some amortized alloc costs). That's a pretty nice win and it makes me more confident that we're not introducing anything that might hurt perf for certain workloads and cause this entire thing to be a pessimization. Since I'm pretty confident that the outstandingWriteAlloc is doing the right thing, I'd like to keep it.


pkg/kv/txn_interceptor_pipeliner.go, line 225 at r25 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

This comment is stale now. Drop the second sentence?

Done. Good catch.


pkg/kv/txn_interceptor_pipeliner.go, line 33 at r26 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

This line should also be in an earlier commit if you happen to do another rebase. Seriously don't go out of your way.

Same as above, I'm pretty sure it's in the correct commit already. I see it in r25.


pkg/kv/txn_interceptor_pipeliner_test.go, line 121 at r25 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Consider extracting this context, here and below, to the top of the test.

Done.


pkg/kv/txn_interceptor_pipeliner_test.go, line 175 at r25 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Consider checking the types of the four responses to make sure it's really the QueryIntent that gets stripped.

Done.


pkg/kv/txn_interceptor_pipeliner_test.go, line 238 at r25 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

This code would be slightly easier to read if you named these variables keyA and keyC. No pressure.

Done.


pkg/kv/txn_interceptor_pipeliner_test.go, line 368 at r25 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Please also test boundary conditions here! I.e., keys "a" (incl) and "c" (excl). That's been such a common source of bugs in my life recently.

Done.


pkg/kv/txn_interceptor_pipeliner_test.go, line 409 at r25 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Theses tests are so readable. How did you sneak this require package in so quietly? I've wanted something like this for so long. 🙌

We've actually had it imported for a while. However, we have traditionally avoided libraries like these because someone on the Go team decided one day that assertions are bad style and that Go is better than them. The more I work with require, the more ridiculous this becomes to me.

For tests that are self-contained and sufficently "unit testy" like these, having a single line assertion makes it significantly easier to read and write, and empirically it results in much more thoroughly tested code. For instance, this file has 174 assertions at the moment. It would probably have about 25 if I needed to write out a 3 line assertion with an intelligible description of what went wrong for each.

And it's not like we don't just build these specialized assertions ourselves anyway. git grep 'func assert' \*_test.go shows 18 instances of custom assertions, and git grep 'assert.* :=' \*_test.go shows another 17. I bet a good chunk of these could be replaced with a single line call to require and save everyone involved time and effort.

I also call bullshit on the "Time invested writing a good error message now pays off later when the test breaks." point. How much more useful is "expected a to equal b, found 2, got 3" than assertion failed: 2 (expected) != 3 (actual) with a line number pointing directly at the failing line? Either way, you need to visit the line of the assertion to figure out what wen't wrong.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: mod the problems you're already aware of. Good work, this has come out really nice!

Reviewed 7 of 13 files at r2, 1 of 33 files at r13, 32 of 32 files at r21, 2 of 2 files at r22, 2 of 2 files at r23, 9 of 9 files at r24, 15 of 18 files at r25, 5 of 5 files at r27, 5 of 5 files at r28.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/kv/dist_sender_server_test.go, line 2446 at r27 (raw file):

		// send them again after refreshing. We also could avoid the
		// MixedSuccessError if we knew that the QueryIntent requests don't
		// write anything, so they're free to send again. Something needs

Do you think this blocks this PR? I agree that we're going to want to introduce the limited idempotency that would mostly do away with MixedSuccessError, but that seems to big a project to do in this PR. Is making QueryIntent exempt from foiling the retry as easy as not setting the boolean here for applicable batches?


pkg/kv/txn_coord_sender.go, line 390 at r27 (raw file):

// TransactionalSender is part of the TxnSenderFactory interface.
func (tcf *TxnCoordSenderFactory) TransactionalSender(
	typ client.TxnType, meta roachpb.TxnCoordMeta,

Nice reduction.


pkg/kv/txn_coord_sender.go, line 428 at r27 (raw file):

		wrapped: tcs.wrapped,
	}
	tcs.interceptorStack = [...]txnInterceptor{

At some point we should add a comment here that explains the ordering of the stack.


pkg/kv/txn_coord_sender.go, line 441 at r27 (raw file):

	}

	tcs.augmentMetaLocked(meta)

Was this simply missing before?


pkg/kv/txn_interceptor_pipeliner.go, line 34 at r27 (raw file):

	"kv.transaction.write_pipelining_enabled",
	"if enabled, transactional writes are pipelined through Raft consensus",
	true,

If this merges before next Monday (7/16, at which point I think we're picking the next alpha SHA), this should be false (and set to true once we have the SHA).


pkg/kv/txn_interceptor_pipeliner.go, line 45 at r27 (raw file):

// before committing. These async writes are referred to as "outstanding writes"
// and this process of proving that an outstanding write succeeded is called
// "resolving" the write.

This may be the wrong place to make this suggestion, but why don't you call this "proving" the write? "Resolving a write" already has a meaning in the context of intent resolution.


pkg/kv/txn_interceptor_pipeliner.go, line 74 at r27 (raw file):

//    evaluate in.
//
// The interceptor queries all outstanding writes before committing a

either resolves, or, if I had my way, proves.


pkg/kv/txn_interceptor_pipeliner.go, line 117 at r27 (raw file):

//     to to be sent by the DistSender in parallel. This would help with this
//     issue by hiding the cost of the QueryIntent requests behind the cost of the
//     "staging" EndTransaction request.

It still wouldn't be optimal to defer the proof of all of them until the last minute (you're catching the tail latency that way, though hopefully that's in the noise compared to the EndTransaction consensus).


pkg/kv/txn_interceptor_pipeliner.go, line 118 at r27 (raw file):

//     issue by hiding the cost of the QueryIntent requests behind the cost of the
//     "staging" EndTransaction request.
//

Stray line?


pkg/kv/txn_interceptor_pipeliner.go, line 130 at r27 (raw file):
Suggestion:

outstandingWrites represent a commitment to proving (via QueryIntent) that the write succeeded in replicating an intent


pkg/kv/txn_interceptor_pipeliner.go, line 395 at r27 (raw file):

	// Clear out the outstandingWrites set and free associated memory.
	if tp.outstandingWrites != nil {
		tp.outstandingWrites.Clear(true /* addNodesToFreelist */)

It's not clear to me when we pass true here.


pkg/kv/txn_interceptor_pipeliner.go, line 449 at r27 (raw file):

	w := item.(*outstandingWrite)
	if seq < w.Sequence {
		// The sequence might have change, which means that a new write was

changed


pkg/kv/txn_interceptor_pipeliner_test.go, line 409 at r25 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We've actually had it imported for a while. However, we have traditionally avoided libraries like these because someone on the Go team decided one day that assertions are bad style and that Go is better than them. The more I work with require, the more ridiculous this becomes to me.

For tests that are self-contained and sufficently "unit testy" like these, having a single line assertion makes it significantly easier to read and write, and empirically it results in much more thoroughly tested code. For instance, this file has 174 assertions at the moment. It would probably have about 25 if I needed to write out a 3 line assertion with an intelligible description of what went wrong for each.

And it's not like we don't just build these specialized assertions ourselves anyway. git grep 'func assert' \*_test.go shows 18 instances of custom assertions, and git grep 'assert.* :=' \*_test.go shows another 17. I bet a good chunk of these could be replaced with a single line call to require and save everyone involved time and effort.

I also call bullshit on the "Time invested writing a good error message now pays off later when the test breaks." point. How much more useful is "expected a to equal b, found 2, got 3" than assertion failed: 2 (expected) != 3 (actual) with a line number pointing directly at the failing line? Either way, you need to visit the line of the assertion to figure out what wen't wrong.

Yeah, I drank that Go cool-aid for a while and I've also come around to realizing that a compact sane amount of helpers (as used here) is the right thing to do.


pkg/kv/txn_interceptor_pipeliner_test.go, line 427 at r27 (raw file):

// TestTxnPipelinerManyWrites tests that a txnPipeliner behaves correctly even
// when it's outstanding write tree grows to a very large size.

its


pkg/kv/txn_interceptor_pipeliner_test.go, line 522 at r27 (raw file):

	})
}

cas


pkg/kv/txn_interceptor_pipeliner_test.go, line 536 at r27 (raw file):

	require.Equal(t, 0, tp.outstandingWritesLen())
}

Could you add a test in which the cluster setting is flipped from off to on and back to off while the transaction processes? I don't think anything will go wrong there, but it would be good to cover that case as well.

Similarly, I don't think you're testing that a non-transactional request flushes the whole pipeline (why does it do that, by the way?)


pkg/roachpb/api.proto, line 1414 at r24 (raw file):

  // TODO(nvanbenschoten): Handling cases where consensus fails would
  // be much more straightforward if all transactional requests were
  // idempotent. We could just re-issue requests.

Don't want to link #26915?


pkg/storage/replica_command.go, line 474 at r27 (raw file):

		// commits, we'll write this data to the left-hand range in the merge
		// trigger.
		b = txn.NewBatch()

This is pretty subtle, did you see spurious failures here due to the intent missing? We do want to make sure to flush the pipeline before calling GetSnapshotForMergeRequest but relying on the fact that it's run through the txn is too subtle for my taste, especially since at the interceptor level you rely on the fact that it doesn't have the IsTransactional flag set. This looks like it's just exploiting a bug; you shouldn't even be able to run a "non-transactioal request" through a transaction, right? I don't have an immediate suggestion here but this looks like a bit of a wart to me. It doesn't have to block this PR, though please add a test that verifies that GetSnapshotForMerge has the right behavior (in case we break it).

Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)


docs/generated/settings/settings.html, line 35 at r26 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it's in the right commit. With 26 revisions on reviewable it's hard to see but it's certainly in 9908cd8.

Gah, thanks.


pkg/kv/txn_interceptor_pipeliner.go, line 482 at r19 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, that's a good point. We were never forcing more than one re-alloc in the tests. Added a new TestTxnPipelinerManyWrites test that adds a lot more writes to stress this.

As to whether this will move the needle on perf in absolute terms, I doubt it. That said, I set a goal of making this class as allocation-friendly as possible because the overhead here is all completely new and I'm trying to avoid any regressions for workloads that won't benefit from txn pipelining. In other words, I don't want there to be any reason to ever turn it off. Along those lines, maintaining an entirely new tree structure per txn seemed problematic enough that I felt it deserved a bit of optimization.

I didn't quite get everything down to zero-alloc per batch, but right now it allocates with respect to batch count, not individual write count (mod some amortized alloc costs). That's a pretty nice win and it makes me more confident that we're not introducing anything that might hurt perf for certain workloads and cause this entire thing to be a pessimization. Since I'm pretty confident that the outstandingWriteAlloc is doing the right thing, I'd like to keep it.

Ah, that makes sense. SGTM.


pkg/kv/txn_interceptor_pipeliner.go, line 45 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This may be the wrong place to make this suggestion, but why don't you call this "proving" the write? "Resolving a write" already has a meaning in the context of intent resolution.

Agreed. I like "proving" too.


pkg/storage/replica_command.go, line 474 at r27 (raw file):

I don't have an immediate suggestion here but this looks like a bit of a wart to me.

Perhaps it doesn't make sense in general for a non-transactional request to stall the pipeline. In that case we could manually add the QueryIntents to the GetSnapshotForMerge batch here and switch back to the non-txnal sender.

craig bot pushed a commit that referenced this pull request Jul 12, 2018
27443: vendor: update github.com/google/btree r=nvanbenschoten a=nvanbenschoten

This picks up a `Clear` method on btree.

Pulled out of #26599 because it was annoying during rebases.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch 3 times, most recently from 6a86d58 to 8a6e9ec Compare July 13, 2018 18:56
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs! The biggest remaining question is whether we want txn pipelining making it into the next alpha fully enabled. I really want to get it in so that it gets an extra month of testing, but can understand reasons to not enable it yet. Perhaps a good compromise would be to get it in with a patch already prepared to disable it at a moment's notice if we see any issues in the next two weeks of prep. I'm on support next week, so I can be extra dilligent about detecting fallout from the change.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/kv/txn_coord_sender.go, line 390 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Nice reduction.

Done.


pkg/kv/txn_coord_sender.go, line 428 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

At some point we should add a comment here that explains the ordering of the stack.

Agreed. I'll add one in #27113.


pkg/kv/txn_coord_sender.go, line 441 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Was this simply missing before?

We used to only set the proto, but now that we're inflating txnCoordSenders with TxnCoordMeta's, we need a bit more. This was added in #27066.


pkg/kv/txn_interceptor_pipeliner.go, line 34 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

If this merges before next Monday (7/16, at which point I think we're picking the next alpha SHA), this should be false (and set to true once we have the SHA).

I was actually shooting to get this into the next alpha so that it gets as much immediate testing as possible. You think I should wait a month to see if there's any fallout?


pkg/kv/txn_interceptor_pipeliner.go, line 45 at r27 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Agreed. I like "proving" too.

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 74 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

either resolves, or, if I had my way, proves.

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 117 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

It still wouldn't be optimal to defer the proof of all of them until the last minute (you're catching the tail latency that way, though hopefully that's in the noise compared to the EndTransaction consensus).

Yeah that's definitely true.


pkg/kv/txn_interceptor_pipeliner.go, line 118 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Stray line?

No, I've gotten in a habit of adding an extra line at the end of an extra long comments so they don't "swallow" the type def. I guess it's a matter of taste.


pkg/kv/txn_interceptor_pipeliner.go, line 130 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Suggestion:

outstandingWrites represent a commitment to proving (via QueryIntent) that the write succeeded in replicating an intent

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 395 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

It's not clear to me when we pass true here.

Done.


pkg/kv/txn_interceptor_pipeliner.go, line 449 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

changed

Done.


pkg/kv/txn_interceptor_pipeliner_test.go, line 427 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

its

Done.


pkg/kv/txn_interceptor_pipeliner_test.go, line 522 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

cas

?


pkg/kv/txn_interceptor_pipeliner_test.go, line 536 at r27 (raw file):

Could you add a test in which the cluster setting is flipped from off to on and back to off while the transaction processes? I don't think anything will go wrong there, but it would be good to cover that case as well.

Done.

Similarly, I don't think you're testing that a non-transactional request flushes the whole pipeline (why does it do that, by the way?)

Done. Added a comment.


pkg/roachpb/api.proto, line 1414 at r24 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Don't want to link #26915?

Done.


pkg/storage/replica_command.go, line 474 at r27 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

I don't have an immediate suggestion here but this looks like a bit of a wart to me.

Perhaps it doesn't make sense in general for a non-transactional request to stall the pipeline. In that case we could manually add the QueryIntents to the GetSnapshotForMerge batch here and switch back to the non-txnal sender.

Added a test and a comment here. The reason why I think it makes sense for non-transaction requests to stall the txn's pipeline is because their request headers are often insufficient to determine all of the keys that they will interact with.

I agree that this is a little subtle, but also don't want it to stall the entire PR. I'll add a TODO.


pkg/kv/dist_sender_server_test.go, line 2446 at r27 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Do you think this blocks this PR? I agree that we're going to want to introduce the limited idempotency that would mostly do away with MixedSuccessError, but that seems to big a project to do in this PR. Is making QueryIntent exempt from foiling the retry as easy as not setting the boolean here for applicable batches?

Done.

This change introduces an assertion that a successful batch with
at least one transactional write request always results in a Raft
command. This is used to ensure that no writing request silently
no-ops. The assertion would have fired had the InitPut optimization
not been removed in the previous commit.

Release note: None
This change adds a new AsyncConsensus flag to the BatchRequest Header.
If set, the request will return to the client before proposing the
request into Raft. All consensus processing will be performed
asynchronously. Because consensus may fail, this means that the
request cannot be expected to succeed. Instead, its success must
be verified.

This will be used by the upcoming txnPipeliner to pipeline transactional
writes.

Release note: None
This change pipelines transactional writes using the approach presented
in cockroachdb#16026 (comment).

The change introduces transactional pipelining by creating a new
`txnReqInterceptor` that hooks into the `TxnCoordSender` called `txnPipeliner`.
txnPipeliner is a txnInterceptor that pipelines transactional writes by using
asynchronous consensus. The interceptor then tracks all writes that have been
asynchronously proposed through Raft and ensures that all interfering
requests chain on to them by first proving that the async writes succeeded.
The interceptor also ensures that when committing a transaction all writes
that have been proposed but not proven to have succeeded are first checked
before committing. These async writes are referred to as "outstanding writes"
and this process of proving that an outstanding write succeeded is called
"resolving" the write.

Chaining on to in-flight async writes is important for two main reasons to
txnPipeliner:
1. requests proposed to Raft will not necessarily succeed. For any number of
   reasons, the request may make it through Raft and be discarded or fail to
   ever even be replicated. A transaction must check that all async writes
   succeeded before committing. However, when these proposals do fail, their
   errors aren't particularly interesting to a transaction. This is because
   these errors are not deterministic Transaction-domain errors that a
   transaction must adhere to for correctness such as conditional-put errors or
   other symptoms of constraint violations. These kinds of errors are all
   discovered during write *evaluation*, which an async write will perform
   synchronously before consensus. Any error during consensus is outside of the
   Transaction-domain and can always trigger a transaction retry.
2. transport layers beneath the txnPipeliner do not provide strong enough
   ordering guarantees between concurrent requests in the same transaction to
   avoid needing explicit chaining. For instance, DistSender uses unary gRPC
   requests instead of gRPC streams, so it can't natively expose strong ordering
   guarantees. Perhaps more importantly, even when a command has entered the
   command queue and evaluated on a Replica, it is not guaranteed to be applied
   before interfering commands. This is because the command may be retried
   outside of the serialization of the command queue for any number of reasons,
   such as leaseholder changes. When the command re-enters the command queue,
   it's possible that interfering commands may jump ahead of it. To combat
   this, the txnPipeliner uses chaining to throw an error when these
   re-orderings would have affected the order that transactional requests
   evaluate in.

The first sanity check benchmark was to run the change against a geo-distributed
cluster running TPCC. A node was located in each of us-east1-b, us-west1-b,
europe-west2-b. The load generator was located in us-east1-b and all leases were
moved to this zone as well. The test first limited all operations to
newOrders, but soon expanded to full TPCC after seeing desirable results.

Without txn pipelining
```
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  180.0s      198.3 154.2%    916.5    486.5   2818.6   4160.7   5637.1   5905.6
```

With txn pipelining
```
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  180.0s      200.0 155.5%    388.2    184.5   1208.0   1946.2   2684.4   2818.6
```

One big caveat is that all foreign key checks needed to be disabled for this
experiment to show a significant improvement. The reason for this is subtle.
TPCC's schema is structured as a large hierarchy of tables, and most of its
transactions insert down a lineage of tables. With FKs enabled, we see a query
to a table immediately after it is inserted into when its child table is
inserted into next. This effectively causes a pipeline stall immediately after
each write, which eliminated a lot of the benefit that we expect to see here.
Luckily, this is a problem that can be avoided by being a little smarter about
foreign keys at the SQL-level. We should not be performing these scans over all
possible column families for a row just to check for existence (resulting kv
count > 0) if we know that we just inserted into that row. We need some kind of
row-level existence cache in SQL to avoid these scans. This will be generally
useful, but will be especially useful to avoid these pipeline stalls.

Release note: None
This commit adds a new `kv.transaction.write_pipelining_max_batch_size`
setting which can be used to bound the number of writes we permit in a
batch that uses async consensus. This is useful because we'll have to
resolve  each write that uses async consensus using a QueryIntent, so
there's a point where it makes more sense to just perform consensus for
the entire batch synchronously and avoid all of the overhead of pipelining.

This crossover point is hard to determine experimentally and depends on
a lot of factors. It's better just to provide this as an option.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/txnPipeliner branch from 8a6e9ec to 11ca84e Compare July 13, 2018 19:00
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/kv/txn_interceptor_pipeliner.go, line 34 at r27 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I was actually shooting to get this into the next alpha so that it gets as much immediate testing as possible. You think I should wait a month to see if there's any fallout?

The next alpha SHA isn't until the following Monday, 7/23. So this will have the entire stability week as well.

@nvanbenschoten
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request Jul 13, 2018
26599: kv: pipeline transactional writes r=nvanbenschoten a=nvanbenschoten

This change pipelines transactional writes using the approach presented
in #16026 (comment).

## Approach

The change introduces transactional pipelining by creating a new
`txnReqInterceptor` that hooks into the `TxnCoordSender` called `txnPipeliner`.
`txnPipeliner` is a `txnReqInterceptor` that pipelines transactional writes by
using asynchronous consensus. The interceptor then tracks all writes that
have been asynchronously proposed through Raft and ensures that all
interfering requests chain on to them by first proving that the async writes
succeeded. The interceptor also ensures that when committing a transaction
all outstanding writes that have been proposed but not proven to have
succeeded are first checked before committing.

Chaining on to in-flight async writes is important for two main reasons to
txnPipeliner:
1. requests proposed to Raft will not necessarily succeed. For any number of
   reasons, the request may make it through Raft and be discarded or fail to
   ever even be replicated. A transaction must check that all async writes
   succeeded before committing. However, when these proposals do fail, their
   errors aren't particularly interesting to a transaction. This is because
   these errors are not deterministic Transaction-domain errors that a
   transaction must adhere to for correctness such as conditional-put errors or
   other symptoms of constraint violations. These kinds of errors are all
   discovered during write *evaluation*, which an async write will perform
   synchronously before consensus. Any error during consensus is outside of the
   Transaction-domain and can always trigger a transaction retry.
2. transport layers beneath the txnPipeliner do not provide strong enough
   ordering guarantees between concurrent requests in the same transaction to
   avoid needing explicit chaining. For instance, DistSender uses unary gRPC
   requests instead of gRPC streams, so it can't natively expose strong ordering
   guarantees. Perhaps more importantly, even when a command has entered the
   command queue and evaluated on a Replica, it is not guaranteed to be applied
   before interfering commands. This is because the command may be retried
   outside of the serialization of the command queue for any number of reasons,
   such as leaseholder changes. When the command re-enters the command queue,
   it's possible that interfering commands may jump ahead of it. To combat
   this, the txnPipeliner uses chaining to throw an error when these
   re-orderings would have affected the order that transactional requests
   evaluate in.

## Testing/Benchmarking

This change will require a lot of testing (warning: very little unit testing is
included so far) and benchmarking. The first batch of benchmarking has been very
promising in terms of reducing transactional latency, but it hasn't shown much
effect on transactional throughput. This is somewhat expected, as the approach
to transactional pipelining is hits a tradeoff between a transaction performing
slightly more work (in parallel) while also having a significantly smaller
contention footprint.

The first sanity check benchmark was to run the change against a geo-distributed
cluster running TPCC. A node was located in each of us-east1-b, us-west1-b,
europe-west2-b. The load generator was located in us-east1-b and all leases were
moved to this zone as well. The test first limited all operations to
newOrders, but soon expanded to full TPCC after seeing desirable results.

#### Without txn pipelining
```
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  180.0s      198.3 154.2%    916.5    486.5   2818.6   4160.7   5637.1   5905.6
```

#### With txn pipelining
```
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  180.0s      200.0 155.5%    388.2    184.5   1208.0   1946.2   2684.4   2818.6
```

One big caveat is that all foreign key checks needed to be disabled for this
experiment to show much of an improvement. The reason for this is subtle. TPCC's
schema is structured as a large hierarchy of tables, and most of its
transactions insert down a lineage of tables. With FKs enabled, we see a query
to a table immediately after it is inserted into when its child table is
inserted into next. This effectively causes a pipeline stall immediately after
each write, which eliminated a lot of the benefit that we expect to see here.
Luckily, this is a problem that can be avoided by being a little smarter about
foreign keys at the SQL-level. We should not be performing these scans over all
possible column families for a row just to check for existence (resulting kv
count > 0) if we know that we just inserted into that row. We need some kind of
row-level existence cache in SQL to avoid these scans. This will be generally
useful, but will be especially useful to avoid these pipeline stalls.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 13, 2018

Build succeeded

@craig craig bot merged commit 11ca84e into cockroachdb:master Jul 13, 2018
@tbg
Copy link
Member

tbg commented Jul 14, 2018 via email

@bdarnell
Copy link
Contributor

+1 to enabling this by default for the alpha, but none of these commits had release notes. Make sure that this gets a mention (both the performance improvement and the off switch) when the alpha release notes are written.

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/txnPipeliner branch July 19, 2018 03:31
@nvanbenschoten
Copy link
Member Author

Thanks for catching that @bdarnell. Added in cockroachdb/docs#3422 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants