-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1558: [C++] Implement boolean filter (selection) kernel, rename comparison kernel-related functions #4366
ARROW-1558: [C++] Implement boolean filter (selection) kernel, rename comparison kernel-related functions #4366
Conversation
16ace51
to
49fd4fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. Here are some comments.
Also you should revert the cpp/submodules/parquet-testing
changes (look at the Github diff view).
cpp/src/arrow/compute/kernels/mask.h
Outdated
/// \param[in] options options | ||
/// \param[out] out resulting array | ||
ARROW_EXPORT | ||
Status Mask(FunctionContext* context, const Array& values, const Array& mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be called "Select" or "Filter"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately Filter
is already taken. Select would be fine, anyone else want to weigh in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to give the same comment actually, but not sure it was nitpicking.
The doc comment now basically says "Mask an array with a boolean filter", while I would typically speak about "Filter an array with a boolean mask".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not too late to rename Filter
, I would rather call that Mask and the one in this PR Filter.
(although Mask is maybe also not ideal for the current Filter, as it does not "mask" (as a verb) but it "creates a mask")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsaintjacques Maybe we could rename FilterFunction -> CompareFunction
, then Mask -> Filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objection with the rename, just make sure to update the glib bindings.
cpp/src/arrow/compute/kernels/mask.h
Outdated
|
||
class FunctionContext; | ||
|
||
struct ARROW_EXPORT MaskOptions {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any options in mind that we may want to add here? Otherwise, this seems a bit superfluous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take
also has an empty options struct, should I remove them both? They are both there just because I'm following the pattern of other kernels which are complicated enough to have non-empty options.
Also, if you don't implement all types here, can you open a JIRA for the missing ones? |
I'll open the follow up JIRA once this one is merged |
I have some structural issues with the implementation. I think the filter kernel ( As motivation for why this matters: suppose that we are performing the same filter on 100 or 1000 arrays in a large table (think about the operation Another design issue: I think it could be better to take an approach more similar to |
Food for though, for the sake of this discussion, let's call a Mask a (multi)set of indices that represents row of interests in an array. Masks can be the results of kernels, e.g. compute::Compare or gandiva::Filter (note that I used the FilterKernel in the original compare PR to match with gandiva naming), or they could be given from a user explicit list of indices. Masks are found in various forms:
BooleanArray and SelectionVector have the properties that the integers are sorted, unique, and well defined (within bounds [0, Array.length())). Note that a bitmap is a selection vector with a different representation but can be interpreted as a list of integers. NumericArray given by users require more Masks alone are not usually of interest, their goal is to defer materialization for optimization purposes, e.g. chaining filter expressions, or practical reasons, e.g. filter on column A and aggregate on column B. ARROW-1558 implements the materialization of an array with a (boolean) mask, but we also have this feature in the Take kernel, but with the mask only in the NumericArray form. Could we explore using the Take kernel to also support BooleanArray? Should we explicit the Mask type via a new type to Datum, e.g. an array with various meta-data flags (unique, sorted, ...)? |
That is certainly doable. If we want to be explicit about selection type then another place that could go is |
Take and Boolean-Filter are distinct computational operations, so we need kernels that implement both of them. I think the semantics of when they are invoked is a separate matter |
BooleanArray and SelectionVector have the properties that the integers are
sorted, unique, and well defined (within bounds [0, Array.length())).
Quick note here: as we introduce more functionality into Arrow, you'll see
that we actually have several things happen with selection vectors that you
should keep in mind:
- selection vectors are frequently unsorted or organized in a way that
coalesces other operations efficiently
- compound selection vectors are commonly used (e.g. 4 bytes for which
batch of data and 2 bytes for which record in that batch of data).
- selection vectors are used to describe both when something is valid as
well as describing where something should be sent (as in a shuffle).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mandatory:
- Tests with more than 3 elements. Use the random array facility for arrays of roughly 2^16 elements.
Good to have:
- Benchmarks with each "core" kernel.
Otherwise, I'm also of the opinion that the indirection is non trivial to follow in the kernel (visitor + 2 dynamic-to-compile dispatch, that's 3 layer of indirection). This is non-blocking, but better for long term maintenance.
7f86e37
to
0fd9088
Compare
Codecov Report
@@ Coverage Diff @@
## master #4366 +/- ##
=========================================
+ Coverage 88.27% 89.2% +0.92%
=========================================
Files 846 697 -149
Lines 103662 92030 -11632
Branches 1253 0 -1253
=========================================
- Hits 91512 82098 -9414
+ Misses 11903 9932 -1971
+ Partials 247 0 -247
Continue to review full report at Codecov.
|
10614e2
to
1dd6b0e
Compare
continue; | ||
} | ||
null_bitmap_builder.UnsafeAppend(true); | ||
value_slices[filtered_i] = list_array.value_slice(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an extremely inefficient way of accomplishing this. It is untenable for large-ish arrays. Think of the number of shared_ptr construction/desctruction (I can count at least 3-4 per slices!) associated, and the ref-count on the type's shared_ptr, etc...
I prefer we simply return unsupported for this type, and/or implement it for trivial static-sized types (Numerics) where we don't need to flatmap on a vector of slices (we can then remove ListArray.value_slice
). This should catch at least a good chunk of use cases.
Is suspect the proper solution (for non-trivial types) is to "expand" the bitmap by repeating the bits of the filter by the corresponding list length and recurse, like in the StructArray implementation.
I'd say remove this, and follow-up ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the followup ticket, it'll be a good moment to implement benchmarks for Filter and I suggest you benchmark a FixedSizeList<Int32,1> vs Int32 to see the penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll write the benchmark for FixedSizeList now, that'd be good for this PR. I disagree that inefficiency is a reason to fail to support lists; it seems more reasonable to provide a correct implementation and let optimization be driven by demand. Furthermore, if Concatenate is inefficient then that seems like a separate issue to benchmark and improve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, filtering of fixed_size_list(int64(), 1)
is significantly slower than int64()
:
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
FilterInt64/32768/0 3 us 3 us 210578 null_percent=0 size=8.192k 2.25337GB/s
FilterInt64/32768/1 4 us 4 us 170663 null_percent=1 size=8.192k 1.86804GB/s
FilterInt64/32768/10 4 us 4 us 167087 null_percent=10 size=8.192k 1.8274GB/s
FilterInt64/32768/50 5 us 5 us 150915 null_percent=50 size=8.192k 1.65151GB/s
FilterFixedSizeList1Int64/32768/0 95 us 95 us 7292 null_percent=0 size=8.192k 82.5558MB/s
FilterFixedSizeList1Int64/32768/1 98 us 98 us 7148 null_percent=1 size=8.192k 79.5829MB/s
FilterFixedSizeList1Int64/32768/10 107 us 107 us 6398 null_percent=10 size=8.192k 72.7087MB/s
FilterFixedSizeList1Int64/32768/50 143 us 143 us 4862 null_percent=50 size=8.192k 54.4943MB/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only is it slower, there's a chance there's a O(n^2) hidden in here. Try with an array of 1M elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's O(n^2)
at least. I'll try the expanded bitmap approach for comparison
-------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------
FilterInt64/32768/0 3 us 3 us 223833 null_percent=0 size=8.192k 2.37242GB/s
FilterInt64/32768/1 4 us 4 us 180752 null_percent=1 size=8.192k 1.9616GB/s
FilterInt64/32768/10 4 us 4 us 172363 null_percent=10 size=8.192k 1.88143GB/s
FilterInt64/32768/50 5 us 5 us 153213 null_percent=50 size=8.192k 1.65837GB/s
FilterInt64/1048576/1 247 us 247 us 2836 null_percent=1 size=262.144k 1013.64MB/s
FilterInt64/8388608/1 1958 us 1953 us 358 null_percent=1 size=2.09715M 1024.14MB/s
FilterFixedSizeList1Int64/32768/0 94 us 94 us 7478 null_percent=0 size=8.192k 83.1916MB/s
FilterFixedSizeList1Int64/32768/1 96 us 96 us 7281 null_percent=1 size=8.192k 81.2395MB/s
FilterFixedSizeList1Int64/32768/10 104 us 104 us 6629 null_percent=10 size=8.192k 75.0306MB/s
FilterFixedSizeList1Int64/32768/50 144 us 144 us 4824 null_percent=50 size=8.192k 54.2683MB/s
FilterFixedSizeList1Int64/1048576/1 3196 us 3188 us 219 null_percent=1 size=262.144k 78.4131MB/s
FilterFixedSizeList1Int64/8388608/1 28612 us 28483 us 25 null_percent=1 size=2.09715M 70.2186MB/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded bitmap is much faster:
-------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------
FilterInt64/32768/0 3 us 3 us 223626 null_percent=0 size=8.192k 2.37535GB/s
FilterInt64/32768/1 4 us 4 us 175575 null_percent=1 size=8.192k 1.9186GB/s
FilterInt64/32768/10 4 us 4 us 169822 null_percent=10 size=8.192k 1.85335GB/s
FilterInt64/32768/50 5 us 5 us 147421 null_percent=50 size=8.192k 1.61824GB/s
FilterInt64/1048576/1 249 us 249 us 2813 null_percent=1 size=262.144k 1003.18MB/s
FilterInt64/8388608/1 1997 us 1997 us 351 null_percent=1 size=2.09715M 1001.68MB/s
FilterFixedSizeList1Int64/32768/0 11 us 11 us 64630 null_percent=0 size=8.192k 711.425MB/s
FilterFixedSizeList1Int64/32768/1 12 us 12 us 56194 null_percent=1 size=8.192k 627.52MB/s
FilterFixedSizeList1Int64/32768/10 12 us 12 us 56525 null_percent=10 size=8.192k 635.185MB/s
FilterFixedSizeList1Int64/32768/50 17 us 17 us 41226 null_percent=50 size=8.192k 463.433MB/s
FilterFixedSizeList1Int64/1048576/1 498 us 498 us 1414 null_percent=1 size=262.144k 501.918MB/s
FilterFixedSizeList1Int64/8388608/1 4515 us 4515 us 164 null_percent=1 size=2.09715M 442.967MB/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark runtimes under 10 microseconds are suspicious. Are those per-iteration times or total execution time? Might want to output nanosecond unit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These times are per-iteration. Here they are with MinTime(1.0)
and nanoseconds:
----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------------
FilterInt64/32768/0/min_time:1.000 4311 ns 4311 ns 327915 null_percent=0 size=8.192k 1.76959GB/s
FilterInt64/32768/1/min_time:1.000 5008 ns 5008 ns 277359 null_percent=1 size=8.192k 1.52345GB/s
FilterInt64/32768/10/min_time:1.000 5108 ns 5108 ns 274582 null_percent=10 size=8.192k 1.49373GB/s
FilterInt64/32768/50/min_time:1.000 5846 ns 5846 ns 242378 null_percent=50 size=8.192k 1.30513GB/s
FilterInt64/1048576/1/min_time:1.000 324842 ns 324841 ns 4316 null_percent=1 size=262.144k 769.606MB/s
FilterInt64/8388608/1/min_time:1.000 2618142 ns 2618130 ns 544 null_percent=1 size=2.09715M 763.904MB/s
FilterFixedSizeList1Int64/32768/0/min_time:1.000 11982 ns 11982 ns 114206 null_percent=0 size=8.192k 652.025MB/s
FilterFixedSizeList1Int64/32768/1/min_time:1.000 13540 ns 13540 ns 101108 null_percent=1 size=8.192k 576.992MB/s
FilterFixedSizeList1Int64/32768/10/min_time:1.000 13554 ns 13554 ns 101989 null_percent=10 size=8.192k 576.408MB/s
FilterFixedSizeList1Int64/32768/50/min_time:1.000 16951 ns 16951 ns 82888 null_percent=50 size=8.192k 460.877MB/s
FilterFixedSizeList1Int64/1048576/1/min_time:1.000 492454 ns 492453 ns 2832 null_percent=1 size=262.144k 507.663MB/s
FilterFixedSizeList1Int64/8388608/1/min_time:1.000 4483677 ns 4483659 ns 315 null_percent=1 size=2.09715M 446.064MB/s
} | ||
|
||
private: | ||
Status UnpackValuesNullCount(const ValueArray& values, const BooleanArray& filter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsaintjacques currently I only use this optimization (don't check null mask when there are no nulls) for primitive types. Do you think it's worthwhile to add it into the nested types as well?
742ae2f
to
b21736d
Compare
NO_CHILD_CASE(Map); | ||
|
||
case Type::STRUCT: { | ||
std::vector<std::unique_ptr<FilterKernel>> child_kernels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized you can't do this, you need to account for the top-level parent null bitmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsaintjacques I do; it's built alongside the child fields in FilterImpl<StructType>
: https://github.com/apache/arrow/pull/4366/files/5a224b24ced09ed6692e0d0ef73a399dab072229#diff-4cce1db37d9827bc657384ea82da0d63R182
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor comments -- I haven't looked too closely at the implementation details but I trust they've been reviewed adequately already. Having class FilterKernel
as a public API seems to be the main thing that may need to get fixed (since it has ARROW_EXPORT on it, it seems that was the intention)
const double null_proportion; | ||
|
||
explicit RegressionArgs(benchmark::State& state) | ||
: size(state.range(0) / 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is state.range(0)
and why divide it by 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from compare-benchmark.cc and now that I look at it I think it's an error in both locations; state.range(0) should be the byte size of the memory tested and it's used that way in aggregate-benchmark.cc. @fsaintjacques can you confirm?
const int64_t memory_size = state.range(0) / 4; | ||
const int64_t array_size = memory_size / sizeof(int64_t); | ||
const double null_percent = static_cast<double>(state.range(1)) / 100.0; | ||
auto rand = random::RandomArrayGenerator(0x94378165); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could put this seed in a constexpr somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
|
||
protected: | ||
std::shared_ptr<DataType> type_; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean for this to be in filter.h? Can you add some docstrings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move it; I was going to follow Cast()
and make only the factory public but I hadn't gotten to that yet
Thanks for working on this -- will be very useful for applications |
5a224b2
to
3d92b6e
Compare
@bkietz can you enable Appveyor on bkietz/arrow? It can help get more timely builds sometimes |
+1 |
Materializes an array masked by a selection array (for example one produced by the filter kernel)