Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" #41700

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented May 17, 2024

Rationale for this change

Concatenating a chunked array into a single array before running the array_take kernels is very inefficient and can lead to out-of-memory crashes. See also #25822.

What changes are included in this PR?

  • Implementation of kernels for "array_take" that can receive a ChunkedArray as values and produce an output without concatenating these chunks
  • Improvements in the dispatching logic of TakeMetaFunction("take") to make "array_take" able to have a chunked_exec kernel for all types (some specialized and some based on concatenation)

Are these changes tested?

By existing tests. Some tests were added in previous PRs that introduced some of the infrastructure to support this.

@felipecrv felipecrv changed the title GH-39565: [C++] GH-39565: [C++] Do not concatenate ChunkedArray values to run "array_take" May 17, 2024
@felipecrv felipecrv changed the title GH-39565: [C++] Do not concatenate ChunkedArray values to run "array_take" GH-39565: [C++] Do not concatenate chunked values of fixed-width types to run "array_take" May 17, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels May 17, 2024
@mapleFU
Copy link
Member

mapleFU commented May 17, 2024

May I ask a unrelated question, when would we call assert and when call DCHECK, since I think they would likely to be same?

@felipecrv
Copy link
Contributor Author

May I ask a unrelated question, when would we call assert and when call DCHECK, since I think they would likely to be same?

We call assert in headers because we don't want to pay the cost of including logging.h everywhere. Think of assert as lighter-weight debug checks. But if you see an assert in a .cc file tell me to change it to DCHECK*.

@felipecrv felipecrv force-pushed the take_chunked_fixed branch 2 times, most recently from fbd97a3 to f4b4e12 Compare June 10, 2024 15:35
felipecrv added a commit that referenced this pull request Jun 13, 2024
… make them private (#42127)

### Rationale for this change

Move TakeXXX free functions into `TakeMetaFunction` and make them private

### What changes are included in this PR?

Code move and some small refactorings in preparation for #41700.

### Are these changes tested?

By existing tests.
* GitHub Issue: #42126

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
@felipecrv felipecrv force-pushed the take_chunked_fixed branch from f4b4e12 to df7de46 Compare June 13, 2024 21:53
@felipecrv felipecrv marked this pull request as ready for review June 15, 2024 15:12
@@ -60,6 +60,7 @@ void RegisterSelectionFunction(const std::string& name, FunctionDoc doc,
{std::move(kernel_data.value_type), std::move(kernel_data.selection_type)},
OutputType(FirstType));
base_kernel.exec = kernel_data.exec;
base_kernel.exec_chunked = kernel_data.chunked_exec;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The member variable is called exec_chunked but the type is called ChunkedExec (so confusing). In this PR I ended up sticking to chunked_exec. Once everything is reviewed and merged I could try to unify things to the direction people prefer.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 15, 2024
@felipecrv felipecrv force-pushed the take_chunked_fixed branch from 467f0f8 to d92da9f Compare June 16, 2024 18:19
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 16, 2024
@felipecrv felipecrv force-pushed the take_chunked_fixed branch 4 times, most recently from 28da5e6 to 2ff6789 Compare June 20, 2024 13:54
@felipecrv
Copy link
Contributor Author

felipecrv commented Oct 7, 2024

@pitrou wouldn't it make sense to keep the responsibility for concatenation to a layer above the kernels? Like a query optimizer? They are in a better position to make memory/time trade-offs than the context-less kernel.

The worst regression (-81%) has the kernel still at 4G items/sec.

TakeChunkedFlatInt64FewRandomIndicesWithNulls
4.173G items/sec 761.551M items/sec   -81.751

I find it very inelegant to put these heuristics at the compute kernel level.

Imagine a pipeline trying to save on memory allocations by keeping the array chunked as much as possible and then a simple filter operation requires allocating enough memory to keep it all in memory.

Another case would be a pipeline where the caller is consolidating a big contiguous array for more operations than just array_take. They should be the ones concatenating.

@pitrou
Copy link
Member

pitrou commented Oct 7, 2024

@pitrou wouldn't it make sense to keep the responsibility for concatenation to a layer above the kernels? Like a query optimizer? They are in a better position to make memory/time trade-offs than the context-less kernel.

Ideally perhaps. In practice this assumes that 1) there is a query optimizer 2) it has enough information about implementation details to make an informed decision.

In practice, take is probably often called directly in the context of PyArrow's sort methods. So this

arrow/python/pyarrow/table.pxi

Lines 1139 to 1143 in e62fbaa

indices = _pc().sort_indices(
self,
options=_pc().SortOptions(sort_keys=[("", order)], **kwargs)
)
return self.take(indices)

The worst regression (-81%) has the kernel still at 4G items/sec.

I might be misreading, but this is the worst regression on the new benchmarks, right (those with a small selection factor)? On the benchmarks with a 1.0 selection factor (such as when sorting), the worst absolute results are around 25 Mitems/sec AFAICT. Or are those results obsolete?

Imagine a pipeline trying to save on memory allocations by keeping the array chunked as much as possible and then a simple filter operation requires allocating enough memory to keep it all in memory.

Well, currently "take" would always concatenate array chunks, so at least there is no regression in that regard.

Still, I understand the concern. We might want to expose an additional option to entirely disable concatenation when possible. But that might be overkill as well.

We will ensure "array_take" returns a ChunkedArray if at least one input
is chunked, just like "take" does. Even when the output fits in a single
chunk.
…::exec_chunked

Before this commit, only the "take" meta function could handle CA
parameters.
This is not a time-saver yet because in TakeCC kernels, every call to
TakeCA will create a new ValuesSpan instance, but this will change in
the next commits.
@felipecrv
Copy link
Contributor Author

@pitrou what conditional checks should I add here to avoid regressions? I'm giving up on making the non-concatenation versions work well for integer arrays and want to merge this PR sooner rather than later and then start working on the string array implementation which is what will unlock most user value in the first place.

@pitrou
Copy link
Member

pitrou commented Dec 11, 2024

By building on this (arguably simplified) analysis:

we're trading the concatenation of the chunked values (essentially allocating a new values array) against the resolution of many chunked indices (essentially allocating two new indices arrays). This is only beneficial if the value width is quite large (say a 256-byte FSB) or the number of indices is much smaller than the number of values.

and assuming the following known values:

  • n_values: length of the values input
  • n_indices: length of the indices input (governing the output length)
  • value_width: byte width of the individual values

Then a simple heuristic could be to concatenate iff n_indices * 16 > n_values * value_width. This wouldn't take into account the larger computational cost associated with chunked indexing, but at least it would disable the chunked resolution approach when it doesn't make sense at all.

(btw, a moderate improvement could probably be achieved by using CompressedChunkLocation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants