GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" #41700

felipecrv · 2024-05-17T03:59:53Z

Rationale for this change

Concatenating a chunked array into a single array before running the array_take kernels is very inefficient and can lead to out-of-memory crashes. See also #25822.

What changes are included in this PR?

Implementation of kernels for "array_take" that can receive a ChunkedArray as values and produce an output without concatenating these chunks
Improvements in the dispatching logic of TakeMetaFunction("take") to make "array_take" able to have a chunked_exec kernel for all types (some specialized and some based on concatenation)

Are these changes tested?

By existing tests. Some tests were added in previous PRs that introduced some of the infrastructure to support this.

GitHub Issue: Do not concatenate ChunkedArray when running Take kernel #39565

cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc

mapleFU · 2024-05-17T14:54:51Z

May I ask a unrelated question, when would we call assert and when call DCHECK, since I think they would likely to be same?

felipecrv · 2024-05-17T15:23:03Z

May I ask a unrelated question, when would we call assert and when call DCHECK, since I think they would likely to be same?

We call assert in headers because we don't want to pay the cost of including logging.h everywhere. Think of assert as lighter-weight debug checks. But if you see an assert in a .cc file tell me to change it to DCHECK*.

… make them private (#42127) ### Rationale for this change Move TakeXXX free functions into `TakeMetaFunction` and make them private ### What changes are included in this PR? Code move and some small refactorings in preparation for #41700. ### Are these changes tested? By existing tests. * GitHub Issue: #42126 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>

felipecrv · 2024-06-15T15:19:28Z

cpp/src/arrow/compute/kernels/vector_selection_internal.cc

@@ -60,6 +60,7 @@ void RegisterSelectionFunction(const std::string& name, FunctionDoc doc,
        {std::move(kernel_data.value_type), std::move(kernel_data.selection_type)},
        OutputType(FirstType));
    base_kernel.exec = kernel_data.exec;
+    base_kernel.exec_chunked = kernel_data.chunked_exec;


The member variable is called exec_chunked but the type is called ChunkedExec (so confusing). In this PR I ended up sticking to chunked_exec. Once everything is reviewed and merged I could try to unify things to the direction people prefer.

felipecrv · 2024-10-07T14:37:33Z

@pitrou wouldn't it make sense to keep the responsibility for concatenation to a layer above the kernels? Like a query optimizer? They are in a better position to make memory/time trade-offs than the context-less kernel.

The worst regression (-81%) has the kernel still at 4G items/sec.

TakeChunkedFlatInt64FewRandomIndicesWithNulls
4.173G items/sec 761.551M items/sec   -81.751

I find it very inelegant to put these heuristics at the compute kernel level.

Imagine a pipeline trying to save on memory allocations by keeping the array chunked as much as possible and then a simple filter operation requires allocating enough memory to keep it all in memory.

Another case would be a pipeline where the caller is consolidating a big contiguous array for more operations than just array_take. They should be the ones concatenating.

pitrou · 2024-10-07T15:35:54Z

@pitrou wouldn't it make sense to keep the responsibility for concatenation to a layer above the kernels? Like a query optimizer? They are in a better position to make memory/time trade-offs than the context-less kernel.

Ideally perhaps. In practice this assumes that 1) there is a query optimizer 2) it has enough information about implementation details to make an informed decision.

In practice, take is probably often called directly in the context of PyArrow's sort methods. So this

arrow/python/pyarrow/table.pxi

Lines 1139 to 1143 in e62fbaa

    
           indices = _pc().sort_indices( 
        
               self, 
        
               options=_pc().SortOptions(sort_keys=[("", order)], **kwargs) 
        
           ) 
        
           return self.take(indices)

The worst regression (-81%) has the kernel still at 4G items/sec.

I might be misreading, but this is the worst regression on the new benchmarks, right (those with a small selection factor)? On the benchmarks with a 1.0 selection factor (such as when sorting), the worst absolute results are around 25 Mitems/sec AFAICT. Or are those results obsolete?

Imagine a pipeline trying to save on memory allocations by keeping the array chunked as much as possible and then a simple filter operation requires allocating enough memory to keep it all in memory.

Well, currently "take" would always concatenate array chunks, so at least there is no regression in that regard.

Still, I understand the concern. We might want to expose an additional option to entirely disable concatenation when possible. But that might be overkill as well.

We will ensure "array_take" returns a ChunkedArray if at least one input is chunked, just like "take" does. Even when the output fits in a single chunk.

…::exec_chunked Before this commit, only the "take" meta function could handle CA parameters.

…omChunks

…trategies

This is not a time-saver yet because in TakeCC kernels, every call to TakeCA will create a new ValuesSpan instance, but this will change in the next commits.

… stable

felipecrv · 2024-12-10T13:15:22Z

@pitrou what conditional checks should I add here to avoid regressions? I'm giving up on making the non-concatenation versions work well for integer arrays and want to merge this PR sooner rather than later and then start working on the string array implementation which is what will unlock most user value in the first place.

pitrou · 2024-12-11T10:25:17Z

By building on this (arguably simplified) analysis:

we're trading the concatenation of the chunked values (essentially allocating a new values array) against the resolution of many chunked indices (essentially allocating two new indices arrays). This is only beneficial if the value width is quite large (say a 256-byte FSB) or the number of indices is much smaller than the number of values.

and assuming the following known values:

n_values: length of the values input
n_indices: length of the indices input (governing the output length)
value_width: byte width of the individual values

Then a simple heuristic could be to concatenate iff n_indices * 16 > n_values * value_width. This wouldn't take into account the larger computational cost associated with chunked indexing, but at least it would disable the chunked resolution approach when it doesn't make sense at all.

(btw, a moderate improvement could probably be achieved by using CompressedChunkLocation)

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels May 17, 2024

felipecrv changed the title ~~GH-39565: [C++]~~ GH-39565: [C++] Do not concatenate ChunkedArray values to run "array_take" May 17, 2024

felipecrv changed the title ~~GH-39565: [C++] Do not concatenate ChunkedArray values to run "array_take"~~ GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" May 17, 2024

felipecrv mentioned this pull request May 17, 2024

GH-39565: [C++] Do not concatenate ChunkedArray when running take function #39566

Closed

felipecrv commented May 17, 2024

View reviewed changes

cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels May 17, 2024

felipecrv force-pushed the take_chunked_fixed branch 2 times, most recently from fbd97a3 to f4b4e12 Compare June 10, 2024 15:35

felipecrv force-pushed the take_chunked_fixed branch from f4b4e12 to df7de46 Compare June 13, 2024 21:53

felipecrv marked this pull request as ready for review June 15, 2024 15:12

felipecrv commented Jun 15, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 15, 2024

felipecrv force-pushed the take_chunked_fixed branch from 467f0f8 to d92da9f Compare June 16, 2024 18:19

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 16, 2024

felipecrv force-pushed the take_chunked_fixed branch 4 times, most recently from 28da5e6 to 2ff6789 Compare June 20, 2024 13:54

felipecrv force-pushed the take_chunked_fixed branch 3 times, most recently from cbf1ddb to 72101ab Compare September 4, 2024 01:34

jorisvandenbossche mentioned this pull request Sep 15, 2024

BUG: Large dataset exception on arrow string operations pandas-dev/pandas#59752

Closed

3 tasks

felipecrv added 20 commits October 7, 2024 14:27

Take: Add VectorKernel::ChunkedExec to SelectionKernelData

844aa5e

Take: VectorKernel::output_chunked should be false for "array_take"

9797c9f

We will ensure "array_take" returns a ChunkedArray if at least one input is chunked, just like "take" does. Even when the output fits in a single chunk.

Take: Make "array_take" handle CA->C cases by populating VectorKernel…

a135471

…::exec_chunked Before this commit, only the "take" meta function could handle CA parameters.

gather_internal.h: Introduce GatherFromChunks

36b69af

Take: Introduce ValueSpan to delay dispatching on chunked-ness

d4b48a3

Take: Implement the FixedWidthTakeChunkedExec() kernel using GatherFr…

3435bd5

…omChunks

Take: Adapt kernel to the ChunkResolver changes

40f2422

TakeMetaFunction: Update comment about what the MetaFunction does

022f6a0

Take: Support CA->C and CC->C cases directly in "array_take" with 2 s…

7f1bc10

…trategies

Take: Simplify TakeMetaFunction even further

ace70fb

Remove all ARROW_NOINLINE from vector_selection_take_internal.cc

4a48d3e

gather_intenal.h: Clarify the semantics of ValiditySpan/IsSrcValid

49b5e97

Take: Fix silly mistake

575d6df

Small fixes from PR feedback

d47462e

Take: Use fixed size blocks of locations when running TakeCA

566a113

Take: Lazily build a ChunkResolver from ValuesSpan

408b4d8

This is not a time-saver yet because in TakeCC kernels, every call to TakeCA will create a new ValuesSpan instance, but this will change in the next commits.

Take: Move the ValuesSpan class to the header

e638991

Selection: Fix UB -- nothing guarantees these references to spans are…

54d410d

… stable

Selection: Make sub-classes constructable with ValueSpan and ArraySpan's

3140ceb

Take: Create a signature for Take kernels support AAA and CAA calls

018320d

felipecrv force-pushed the take_chunked_fixed branch from 72101ab to 018320d Compare October 7, 2024 17:29

felipecrv mentioned this pull request Dec 10, 2024

[C++] Take kernel can't handle ChunkedArrays that don't fit in an Array #25822

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" #41700

GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" #41700

felipecrv commented May 17, 2024 •

edited

Loading

mapleFU commented May 17, 2024

felipecrv commented May 17, 2024

felipecrv Jun 15, 2024

felipecrv commented Oct 7, 2024 •

edited

Loading

pitrou commented Oct 7, 2024

felipecrv commented Dec 10, 2024

pitrou commented Dec 11, 2024

GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" #41700

Are you sure you want to change the base?

GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" #41700

Conversation

felipecrv commented May 17, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

mapleFU commented May 17, 2024

felipecrv commented May 17, 2024

felipecrv Jun 15, 2024

Choose a reason for hiding this comment

felipecrv commented Oct 7, 2024 • edited Loading

pitrou commented Oct 7, 2024

felipecrv commented Dec 10, 2024

pitrou commented Dec 11, 2024

felipecrv commented May 17, 2024 •

edited

Loading

felipecrv commented Oct 7, 2024 •

edited

Loading