ARROW-11928: [C++] Execution engine API #10204

pitrou · 2021-04-29T17:44:41Z

No description provided.

pitrou · 2021-04-29T17:49:41Z

@wesm @bkietz @westonpace This is still quite draft-ish but you may want to look out for potential gotchas.

westonpace

Just a few questions, I suspect I'll have more later

westonpace · 2021-04-30T00:15:45Z

cpp/src/arrow/engine/exec_plan.h

+  /// is started before all of its inputs.
+  Status StartProducing();
+
+  // XXX should we also have `void StopProducing()`?


Or you could use a stop token.

Right. I think a StopToken would be used at a higher level, but I'm not sure.

westonpace · 2021-04-30T00:18:17Z

cpp/src/arrow/engine/exec_plan.h

+  virtual ~ExecNode();
+
+  virtual const char* kind_name() = 0;
+  // The number of inputs and outputs expected by this node


Expected? Or current? What's the purpose of these fields?

Expected. They're mostly there for validation and debugging. I'm not sure we'll keep them, but introspectability can be useful.

westonpace · 2021-04-30T00:22:01Z

cpp/src/arrow/engine/exec_plan.h

+
+  /// The datatypes for each input
+  // XXX Should it be std::vector<DataType>?
+  const std::vector<BatchDescr>& input_descrs() const { return input_descrs_; }


When starting a scan Arrow doesn't know the physical schema at all. Reading through this it seems like the execution plan cannot be created without knowing the physical schema. Is the plan to do some initial scanning before creating an execution plan?

Ah! I hadn't thought about that... Like num_inputs() and num_outputs(), those should be mostly useful for validation, so perhaps we can remove them.

On the other hand, even if the physical dataset schema is not known, the logical expected schema should be known (and the dataset scanner should cast implicitly). @bkietz What do you think?

The dataset_schema (reader schema in Avro parlance) is always known when we construct a ScanNode and add it to a plan. Inferring dataset schema from physical schema is deferred to the dataset factories in discovery.h. I believe it's safe to rely on having a known schema along all edges when constructing these nodes

cpp/src/arrow/testing/future_util.h

westonpace · 2021-04-30T01:00:15Z

cpp/src/arrow/engine/exec_plan.h

+  /// - StopProducing() should recurse into the inputs
+  /// - StopProducing() must be idempotent
+
+  // XXX What happens if StartProducing() calls an output's InputReceived()


Might be good to create a map node to test these rules out. Right now you have a source node which doesn't have to worry about input callbacks and a sink node which doesn't have to worry about output callbacks but no intermediate nodes.

Oh, yes, I plan to add a bunch of other test nodes to stress the execution model (including, hopefully, intermediate nodes that split or combine their input(s), that introduce random delays, that shuffle output order, that limit output size).

westonpace · 2021-04-30T01:03:01Z

cpp/src/arrow/engine/exec_plan.h

+  /// - StopProducing() should recurse into the inputs
+  /// - StopProducing() must be idempotent
+
+  // XXX What happens if StartProducing() calls an output's InputReceived()


Also, there's no examples of a node with 2 inputs. For example, what happens if one input arrives faster than the other? I suppose you would queue for some limit and then apply back pressure?

That depends if the 2 inputs are needed to produce some output. That's probably node-dependent. But there's certainly some complication when a node has more than one input or output.

westonpace · 2021-04-30T01:05:55Z

cpp/src/arrow/engine/exec_plan.h

+  // (or PauseProducing()) because it received enough data?
+  //
+  // Right now, since synchronous calls happen in both directions (input to
+  // output and then output to input), a node must be careful to be reentrant


What is the output -> input direction? Backpressure?

Yes, backpressure. Is "downstream" and "upstream" more readable than "input to output" and "output to input"?

cpp/CMakeLists.txt

bkietz · 2021-05-04T15:51:44Z

cpp/src/arrow/engine/exec_plan.h

+
+  /// The datatypes for each input
+  // XXX Should it be std::vector<DataType>?
+  const std::vector<BatchDescr>& input_descrs() const { return input_descrs_; }


The dataset_schema (reader schema in Avro parlance) is always known when we construct a ScanNode and add it to a plan. Inferring dataset schema from physical schema is deferred to the dataset factories in discovery.h. I believe it's safe to rely on having a known schema along all edges when constructing these nodes

cpp/src/arrow/engine/exec_plan.h

bkietz · 2021-05-04T16:19:03Z

cpp/src/arrow/engine/exec_plan.h

+  virtual void InputReceived(int input_index, int seq_num, compute::ExecBatch batch) = 0;
+
+  /// Signal error to ExecNode
+  virtual void ErrorReceived(int input_index, Status error) = 0;


Maybe too magic, but:

Suggested change

virtual void InputReceived(int input_index, int seq_num, compute::ExecBatch batch) = 0;

/// Signal error to ExecNode

virtual void ErrorReceived(int input_index, Status error) = 0;

virtual void Receive(int input_index, int seq_num, Result<compute::ExecBatch> batch) = 0;

Hmm, no preference from me.

cpp/src/arrow/engine/exec_plan.h

bkietz · 2021-05-04T16:25:49Z

cpp/src/arrow/engine/exec_plan.h

+  ///   and StopProducing()
+
+  /// Transfer input batch to ExecNode
+  virtual void InputReceived(int input_index, int seq_num, compute::ExecBatch batch) = 0;


Would it be possible to identify inputs and outputs by pointer rather than index?

Suggested change

virtual void InputReceived(int input_index, int seq_num, compute::ExecBatch batch) = 0;

virtual void InputReceived(ExecNode* input, int seq_num, compute::ExecBatch batch) = 0;

I think we won't be gaining a huge performance benefit by using indices at the top level, and it requires much more cognitive overhead of tracking a node's identifying indices within other nodes. If we identified by pointer instead, a node could push output with:

auto batch = ...; // push to outputs for (ExecNode* output : outputs_) { output->InputReceived(this, seq_num, batch); }

Instead of:

auto batch = ...; // push to outputs for (OutputNode output : outputs_) { output.output->InputReceived(output.input_index, seq_num, batch); }

That would certainly work, API-wise, but I'm a bit wary of introducing hash lookups out of convenience (unless you have another implementation in mind?).

I was thinking of a simple linear search. I wouldn't expect nodes to deal with a large number of inputs, and for vectors of integers shorter than ~256 I'd expect a linear search to be faster than a hash lookup anyway

cpp/src/arrow/util/iterator.h

bkietz · 2021-05-06T19:32:22Z

I think we can defer support for multiple consumers for the moment, I'm refactoring to singular output nodes

pitrou · 2021-05-07T15:02:05Z

I think we can defer support for multiple consumers for the moment, I'm refactoring to singular output nodes

Hmm, I'm not convinced about this. Say you have a RecordBatchReader with several columns of interest. You want each column to feed a separate execution node (say, one different scalar function per column). How do you do that if you can only have singular output nodes? Implement a "supernode" that applies multiple scalar functions to multiple columns?

bkietz · 2021-05-07T15:38:20Z

I would workaround this with an API for producing conjoined split nodes:

/// Split the output of `to_be_split` for consumption by multiple downstream nodes
///
/// Each split node will list `to_be_split` as its only input, though `to_be_split` will only
/// consider the split node as its output. Whenever `to_be_split` pushes to the first split
/// node, that data will be replicated to the outputs of each split node. Back pressure
/// on any split node will also be felt by `to_be_split`.
std::vector<std::unique_ptr<ExecNode>> MakeSplit(ExecNode* to_be_split, int n_splits);

I think that the multiple consumer case will be sufficiently uncommon that making it
a first class citizen of the API will be more confusing than dedicated split nodes.
Doubly so since there's a semantic distinction between the inputs: a FilterNode would
have one input for values-to-be-filtered and a second for masks/selection vectors
whereas any node with multiple outputs would be pushing identical batches to each.

@michalursa

pitrou · 2021-05-07T15:46:21Z

Regardless of whether splitting is delegated to a specialized node (which I agree is a good idea), I'm not sure what it gains to remove multiple outputs from the API. Nodes with a single output don't have to deal with the complication in either case.

westonpace · 2021-05-07T19:15:18Z

This article has an interesting description of how DAGs (which implies multiple outputs) are used by Materialize to optimize query plans: https://scattered-thoughts.net/writing/materialize-decorrelation

I don't know nearly enough to know how common or essential this is.

As for complications, multiple outputs introduces buffering (in both pull and push models). While you are delivering a result to consumer 1 you have to buffer the result so you can later deliver it to consumer 2. If your query plan's bottleneck is down the consumer 1 path you could potentially accumulate results in the multicasting operator and need to trigger backpressure.

That's the main complication that jumps to mind. That being said, this "multicasting" is one of the more confusing points of Rx (Reactive). But that may just come from the dynamic and linear way in which observers are chained. Since you're already building a graph (that presumably is unchanging for the duration of the execution) that shouldn't be a problem.

pitrou · 2021-05-07T19:18:32Z

It's worth repeating that multiple consumers will exist one way or another. They can be exposed in the API, or they can be buried inside the implementation of a special node. Personally, I would find it slightly more logical to expose them in the API, and it would help give the user a better view of what happens (think visualizing the execution graph, for instance).

github-actions · 2021-05-10T21:42:06Z

https://issues.apache.org/jira/browse/ARROW-11928

This reverts commit 7f1d533.

Closes apache#10204 from pitrou/ARROW-11928-engine-hierarchy-v2 Lead-authored-by: "Antoine Pitrou <[email protected]>" Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

pitrou marked this pull request as draft April 29, 2021 17:44

pitrou mentioned this pull request Apr 29, 2021

[WIP] ARROW-11928: [C++] Execution engine API #9742

Closed

github-actions bot added the Component: C++ label Apr 29, 2021

westonpace reviewed Apr 30, 2021

View reviewed changes

pitrou force-pushed the ARROW-11928-engine-hierarchy-v2 branch from 5dda970 to 14a9abb Compare May 3, 2021 16:07

bkietz reviewed May 4, 2021

View reviewed changes

bkietz force-pushed the ARROW-11928-engine-hierarchy-v2 branch from 16bdf53 to 51f2a3c Compare May 6, 2021 20:36

bkietz marked this pull request as ready for review May 10, 2021 21:34

bkietz changed the title ~~[WIP] ARROW-11928: [C++] Execution engine API~~ ARROW-11928: [C++] Execution engine API May 10, 2021

bkietz approved these changes May 10, 2021

View reviewed changes

pitrou and others added 8 commits May 11, 2021 10:44

ARROW-11928: [C++] Execution engine API

9ec8978

remove engine component

1b7cbf4

~namespace engine

10119aa

refactor to singular outputs

6ccdb61

fix input_labels()

6f5c7dd

revert to multiple outputs

2874213

reintroduce explicit num_outputs()

cf294ce

Use Loop in ExecPlanTest

7f1d533

bkietz force-pushed the ARROW-11928-engine-hierarchy-v2 branch from 53a8901 to 7f1d533 Compare May 11, 2021 14:44

Revert "Use Loop in ExecPlanTest"

46dcc48

This reverts commit 7f1d533.

bkietz closed this in 046a30b May 11, 2021

pitrou deleted the ARROW-11928-engine-hierarchy-v2 branch May 11, 2021 18:10

asfimport mentioned this pull request Oct 8, 2021

[C++][Compute] Add ExecNode hierarchy #27765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11928: [C++] Execution engine API #10204

ARROW-11928: [C++] Execution engine API #10204

pitrou commented Apr 29, 2021

pitrou commented Apr 29, 2021

westonpace left a comment

westonpace Apr 30, 2021

pitrou Apr 30, 2021

westonpace Apr 30, 2021

pitrou Apr 30, 2021

westonpace Apr 30, 2021

pitrou Apr 30, 2021

pitrou Apr 30, 2021

bkietz May 4, 2021

westonpace Apr 30, 2021

pitrou Apr 30, 2021

westonpace Apr 30, 2021

pitrou Apr 30, 2021

westonpace Apr 30, 2021

pitrou Apr 30, 2021

bkietz May 4, 2021

bkietz May 4, 2021

pitrou May 7, 2021

bkietz May 4, 2021

pitrou May 7, 2021

bkietz May 7, 2021

bkietz commented May 6, 2021

pitrou commented May 7, 2021

bkietz commented May 7, 2021

pitrou commented May 7, 2021

westonpace commented May 7, 2021

pitrou commented May 7, 2021

github-actions bot commented May 10, 2021

	virtual void InputReceived(int input_index, int seq_num, compute::ExecBatch batch) = 0;
	virtual void InputReceived(ExecNode* input, int seq_num, compute::ExecBatch batch) = 0;

ARROW-11928: [C++] Execution engine API #10204

ARROW-11928: [C++] Execution engine API #10204

Conversation

pitrou commented Apr 29, 2021

pitrou commented Apr 29, 2021

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkietz commented May 6, 2021

pitrou commented May 7, 2021

bkietz commented May 7, 2021

pitrou commented May 7, 2021

westonpace commented May 7, 2021

pitrou commented May 7, 2021

github-actions bot commented May 10, 2021