[C++][Acero] Add the ability to merge already-sorted input nodes #38381

JerAguilon · 2023-10-20T22:48:39Z

Describe the enhancement requested

I have a use case wherein I want to asof join two datasets. However, each dataset on the left and right hand side is sharded across N files. Each file is individually sorted from top to bottom. Imagine a dataset composed of two files:

file A

ts,col
1,"foo"
3,"foo"
5,"foo"

file B

ts,col
2,"bar"
3,"bar"
6,"bar"

After merging

ts,col
1,"foo"
2,"bar"
3,"bar"
3,"foo"
5,"foo"
6,"bar"

Sorting using order_by isn't tenable for huge datasets. It'd be more efficient to stream each input table and emit batches in sorted order via a heap.

Merging N files this way is actually computationally similar to asof_join_node.cc in that you can efficiently do it by buffering data from all your input nodes and spawning a process thread that emits data in sorted fashion.

I propose refactoring some of the guts of asof_join_node.cc so that we can achieve the above computation. I think that this will unlock lots of potential that is hidden behind specialized databases like KDB.

I have a draft PR for the idea here: #38380 and have locally tested it for O(100GB) files.

Would be curious to get opinions from @icexelloss, @bkietz, and @westonpace on this approach!

Component(s)

C++

The text was updated successfully, but these errors were encountered:

icexelloss · 2023-10-23T14:08:56Z

I think this is a useful operation to have

JerAguilon · 2023-10-25T23:27:39Z

PR is in a presentable state now--open to thoughts!

### Rationale for this change This is an implementation of a node that can merge N sorted inputs (only in ascending order for a first pass). Where possible I have shared components with `asof_join_node.cc`. Full description/use case is described in #38381 ### What changes are included in this PR? * Take out relevant guts of asofjoin to stream data top to bottom/consume in a non blocking manner * Implement a sorted merger ### Are these changes tested? Basic test added. Locally I have tested this on 100+ gigabytes of parquet, sharded across 50+ files. Happy to add a benchmark test on top of the basic test, but submitting now for code feedback. ### Are there any user-facing changes? Yes, `sorted_merge` is now an exposed declaration Lead-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: jeremy <[email protected]> Co-authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>

westonpace · 2023-11-06T17:48:44Z

Issue resolved by pull request 38380
#38380

### Rationale for this change This is an implementation of a node that can merge N sorted inputs (only in ascending order for a first pass). Where possible I have shared components with `asof_join_node.cc`. Full description/use case is described in apache#38381 ### What changes are included in this PR? * Take out relevant guts of asofjoin to stream data top to bottom/consume in a non blocking manner * Implement a sorted merger ### Are these changes tested? Basic test added. Locally I have tested this on 100+ gigabytes of parquet, sharded across 50+ files. Happy to add a benchmark test on top of the basic test, but submitting now for code feedback. ### Are there any user-facing changes? Yes, `sorted_merge` is now an exposed declaration Lead-authored-by: Jeremy Aguilon <[email protected]> Co-authored-by: jeremy <[email protected]> Co-authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>

JerAguilon added the Type: enhancement label Oct 20, 2023

github-actions bot added the Component: C++ label Oct 20, 2023

JerAguilon mentioned this issue Oct 25, 2023

GH-38381: [C++][Acero] Create a sorted merge node #38380

Merged

westonpace added this to the 15.0.0 milestone Nov 6, 2023

github-actions bot assigned JerAguilon Nov 6, 2023

westonpace closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Acero] Add the ability to merge already-sorted input nodes #38381

[C++][Acero] Add the ability to merge already-sorted input nodes #38381

JerAguilon commented Oct 20, 2023 •

edited

Loading

icexelloss commented Oct 23, 2023

JerAguilon commented Oct 25, 2023

westonpace commented Nov 6, 2023

[C++][Acero] Add the ability to merge already-sorted input nodes #38381

[C++][Acero] Add the ability to merge already-sorted input nodes #38381

Comments

JerAguilon commented Oct 20, 2023 • edited Loading

Describe the enhancement requested

Component(s)

icexelloss commented Oct 23, 2023

JerAguilon commented Oct 25, 2023

westonpace commented Nov 6, 2023

JerAguilon commented Oct 20, 2023 •

edited

Loading