distsql: elide RowChannel when connecting local processors #20550

petermattis · 2017-12-07T15:40:35Z

RowChannel is currently used for connecting local distsql Processors. A RowChannel is essentially a channel with a fixed size. #19288 identified RowChannel as a throughput bottleneck in distsql processing.

RowChannel implements both the RowSource and RowReceiver interfaces. Each Processor is run within a separate goroutine and the Processor.Run method loops, pulling rows from its inputs (RowSource), processing them and emitting them via ProcOutputHelper to a RowReceiver.

With a modest amount of refactoring, I think this could be restructured by having processor implementations also implement the RowSource interface. The internals of such a processor would be reorganized so that Next() would pull from its input(s), process and return the next row. This would be akin to the planNode interface. A processor which implements RowSource would either be run via a call to Run(), in which case Next() would never be called, or zero or more calls to Next(). That is, a processor would either act as a RowSource or as a Processor within a given flow, never both.

We can incrementally move towards this new API. Currently, Flow.setup() always joins processors together with a RowChannel or MultiplexedRowChannel. If a processor implements RowSource, creation of the RowChannel can be elided. We would have to mark processors joined in this way and only call Processor.Run on processors that are not acting as RowSources.

See also #19134 which mentions "fusing" multiple processors into a single goroutine. The above proposal would achieve this.

This proposal needs to be validated by benchmarking before significant work is done. There are undoubtedly complexities with regards to cancellation and the particulars of the RowSource interface that will make this challenging.

@arjunravinarayan, @asubiotto You've likely already been thinking in this direction, but I couldn't find an issue discussing this.

The text was updated successfully, but these errors were encountered:

Add BenchmarkRowChannelPipeline which benchmarks throughput and latency through a pipeline of RowChannels. The results are disappointing: name time/op RowChannelPipeline/1-8 110ns ± 2% RowChannelPipeline/2-8 272ns ± 1% RowChannelPipeline/3-8 377ns ± 4% RowChannelPipeline/4-8 421ns ± 1% name speed RowChannelPipeline/1-8 72.8MB/s ± 3% RowChannelPipeline/2-8 29.3MB/s ± 1% RowChannelPipeline/3-8 21.2MB/s ± 4% RowChannelPipeline/4-8 19.0MB/s ± 1% Release note: None See cockroachdb#20550 See cockroachdb#20553 See cockroachdb#20568

rjnn · 2017-12-11T03:56:05Z

@andreimatei brought up in #20584 the idea of "synchronous scheduling of co-located processors work". I'm not entirely sure what that means, but it seems to be an alternate design to this. I'd like to see it spelled out more explicitly.

My thoughts: two reasons I like the design outlined in this issue to maintaining/improving the current design using different goroutines for different processors is that:

direct function calls from one processor to the next have great cache locality.
When we know for sure that two processors are one after the other, we can use a single buffer to pass a pre-allocated batch of rows so that we don't create extra garbage when passing data between them.
in the longer run, we want to fuse the processors by JITting them into a single pipeline, to avoid interface indirection. The performance benefits of this is not as important when we already have row batches, but in the limit case JITting relatively small batches has the best cache locality[1]. JITting is probably not something we will do in 2018, but it bears keeping in mind what the optimal design looks like according to the fastest dataflow designs out there.

We definitely need benchmarks to convince anyone how direct function calls compare to using goroutines for separate processors. @danhhz has already done some preliminary work in this area, I'd also appreciate his thoughts here.

[1]: http://db.csail.mit.edu/pubs/abadi-column-stores.pdf section 4.1

petermattis · 2017-12-11T13:40:46Z

@arjunravinarayan I'd like to reiterate that fusing distsql processors together via JIT'ing is definitely not going to happen in 2018. There are lots of bits of lower hanging performance fruit to pluck first. And we also need to get distsql to feature parity with local sql. Definitely worthwhile to keep such fusing in mind, but I want to make sure any other spectators don't get too far ahead of the work immediately ahead.

Refactor tableReader to implement the RowSource interface. Refactor tableReader.Run() to be implemented in terms of tableReader.Next() (i.e. the RowSource interface). See cockroachdb#20550 Release note: None

Refactor tableReader to implement the RowSource interface. Refactor tableReader.Run() to be implemented in terms of tableReader.Next() (i.e. the RowSource interface). Adjusted BenchmarkTableReader to avoid using a RowBuffer. This shows the benefit that can be achieved by using TableReader as a RowSource ("old" below is with the benchmark modified to use a RowChannel). name old time/op new time/op delta TableReader-8 11.6ms ± 5% 9.4ms ± 3% -18.81% (p=0.000 n=10+10) See cockroachdb#20550 Release note: None

petermattis · 2017-12-16T01:42:35Z

After a day spent looking at profiles of distsql processors, it is clear that the lowest hanging fruit to pluck other than the overhead of RowChannel, is to eliminate allocations. As @arjunravinarayan mentioned above, one of the advantages of eliding RowChannel is that we can avoid allocations between processors.

Once processors implement RowSource, another smaller opportunity is to elide the "post-processing" steps if they are not present. The filter/limit/projection post-processing could be implemented via separate nodes that also model RowSource. Experimentation would be necessary to determine if this is beneficial or not.

Refactor tableReader to implement the RowSource interface. Refactor tableReader.Run() to be implemented in terms of tableReader.Next() (i.e. the RowSource interface). Adjusted BenchmarkTableReader to avoid using a RowBuffer. This shows the benefit that can be achieved by using TableReader as a RowSource ("old" below is with the benchmark modified to use a RowChannel). name old time/op new time/op delta TableReader-8 11.6ms ± 5% 9.4ms ± 3% -18.81% (p=0.000 n=10+10) See cockroachdb#20550 Release note: None

Elide RowChannel when connecting local processors in simple cases. A simple `SELECT COUNT(*) FROM test.kv` query where `test.kv` contains 5m rows is reduced from 5.4s to 4.3s (a 20% speedup). When using distsql, this query utilizes a `tableReader` connected to an `aggregator`. For comparison, this query takes 4.5s when running with `set distsql=off`. These numbers are from a local single-node cluster. See cockroachdb#20550 Release note (performance improvement): Speed up distsql query execution by "fusing" processors executing on the same node together.

jordanlewis · 2018-01-25T16:36:22Z

Can this be closed, or are we waiting for a full solution?

petermattis assigned rjnn and asubiotto Dec 7, 2017

This was referenced Dec 7, 2017

distsql: investigate modeling an inbound stream as a RowSource #20553

Closed

distsql: investigate modeling outbox directly as a RowReceiver #20568

Closed

petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Dec 7, 2017

petermattis mentioned this issue Dec 8, 2017

distsqlrun: add RowChannel pipeline benchmark #20575

Merged

petermattis mentioned this issue Dec 8, 2017

sql/distsqlrun: refactor tableReader to implement RowSource #20584

Merged

petermattis mentioned this issue Dec 11, 2017

perf: distsql row batching #20555

Closed

petermattis mentioned this issue Jan 4, 2018

distsql: investigate reducing memory allocations between processors #21222

Closed

petermattis mentioned this issue Jan 5, 2018

sql/distsqlrun: elide RowChannel in simple cases #21254

Merged

asubiotto mentioned this issue Feb 7, 2018

perf: reduce allocations in ProcOutputHelper.ProcessRow in some cases #22462

Closed

petermattis added this to the 2.1 milestone Feb 21, 2018

jordanlewis closed this as completed May 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: elide RowChannel when connecting local processors #20550

distsql: elide RowChannel when connecting local processors #20550

petermattis commented Dec 7, 2017

rjnn commented Dec 11, 2017

petermattis commented Dec 11, 2017

petermattis commented Dec 16, 2017

jordanlewis commented Jan 25, 2018

distsql: elide RowChannel when connecting local processors #20550

distsql: elide RowChannel when connecting local processors #20550

Comments

petermattis commented Dec 7, 2017

rjnn commented Dec 11, 2017

petermattis commented Dec 11, 2017

petermattis commented Dec 16, 2017

jordanlewis commented Jan 25, 2018