Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1558: [C++] Implement boolean filter (selection) kernel, rename comparison kernel-related functions #4366

Closed

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented May 21, 2019

Materializes an array masked by a selection array (for example one produced by the filter kernel)

@bkietz bkietz force-pushed the 1558-Implement-boolean-selection-kernels branch from 16ace51 to 49fd4fa Compare May 22, 2019 13:33
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Here are some comments.

Also you should revert the cpp/submodules/parquet-testing changes (look at the Github diff view).

/// \param[in] options options
/// \param[out] out resulting array
ARROW_EXPORT
Status Mask(FunctionContext* context, const Array& values, const Array& mask,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be called "Select" or "Filter"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately Filter is already taken. Select would be fine, anyone else want to weigh in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to give the same comment actually, but not sure it was nitpicking.

The doc comment now basically says "Mask an array with a boolean filter", while I would typically speak about "Filter an array with a boolean mask".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not too late to rename Filter, I would rather call that Mask and the one in this PR Filter.

(although Mask is maybe also not ideal for the current Filter, as it does not "mask" (as a verb) but it "creates a mask")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsaintjacques Maybe we could rename FilterFunction -> CompareFunction, then Mask -> Filter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objection with the rename, just make sure to update the glib bindings.


class FunctionContext;

struct ARROW_EXPORT MaskOptions {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any options in mind that we may want to add here? Otherwise, this seems a bit superfluous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take also has an empty options struct, should I remove them both? They are both there just because I'm following the pattern of other kernels which are complicated enough to have non-empty options.

cpp/src/arrow/compute/kernels/mask.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/mask.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/mask-test.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented May 22, 2019

Also, if you don't implement all types here, can you open a JIRA for the missing ones?

@bkietz
Copy link
Member Author

bkietz commented May 22, 2019

I'll open the follow up JIRA once this one is merged

@wesm
Copy link
Member

wesm commented May 22, 2019

I have some structural issues with the implementation. I think the filter kernel (FilterKernel seems better than MaskKernel -- MaskKernel might be a good name for a kernel that sets values to be null) needs to write into pre-allocated primitive memory. This means that the allocation needs to happen outside of the kernel.

As motivation for why this matters: suppose that we are performing the same filter on 100 or 1000 arrays in a large table (think about the operation df[selection] where df is a DataFrame). Why should we count the set bits more than once?

Another design issue: I think it could be better to take an approach more similar to CastKernel in generating functors for the types that are implemented. This issue is less serious than the preallocation issue. I'm not sure it is a good habit to be doing single- or multiple- dynamic dispatch on each invocation to the kernel's Call function

@fsaintjacques
Copy link
Contributor

fsaintjacques commented May 23, 2019

Food for though,

for the sake of this discussion, let's call a Mask a (multi)set of indices that represents row of interests in an array. Masks can be the results of kernels, e.g. compute::Compare or gandiva::Filter (note that I used the FilterKernel in the original compare PR to match with gandiva naming), or they could be given from a user explicit list of indices.

Masks are found in various forms:

  • An Arrow integer NumericArray (user given)
  • An Arrow boolean BooleanArray (compute::Compare and soon compute::IsIn kernel)
  • A gandiva::SelectionVector where T is an integer type (gandiva::Filter)

BooleanArray and SelectionVector have the properties that the integers are sorted, unique, and well defined (within bounds [0, Array.length())). Note that a bitmap is a selection vector with a different representation but can be interpreted as a list of integers. NumericArray given by users require more
care since they can be out-of-bound (undefined behavior in C/C++, can even trigger a segfault).

Masks alone are not usually of interest, their goal is to defer materialization for optimization purposes, e.g. chaining filter expressions, or practical reasons, e.g. filter on column A and aggregate on column B.

ARROW-1558 implements the materialization of an array with a (boolean) mask, but we also have this feature in the Take kernel, but with the mask only in the NumericArray form. Could we explore using the Take kernel to also support BooleanArray? Should we explicit the Mask type via a new type to Datum, e.g. an array with various meta-data flags (unique, sorted, ...)?

@bkietz
Copy link
Member Author

bkietz commented May 23, 2019

That is certainly doable. If we want to be explicit about selection type then another place that could go is TakeOptions.

@wesm
Copy link
Member

wesm commented May 23, 2019

Take and Boolean-Filter are distinct computational operations, so we need kernels that implement both of them. I think the semantics of when they are invoked is a separate matter

@jacques-n
Copy link
Contributor

jacques-n commented May 23, 2019 via email

Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mandatory:

  • Tests with more than 3 elements. Use the random array facility for arrays of roughly 2^16 elements.

Good to have:

  • Benchmarks with each "core" kernel.

Otherwise, I'm also of the opinion that the indirection is non trivial to follow in the kernel (visitor + 2 dynamic-to-compile dispatch, that's 3 layer of indirection). This is non-blocking, but better for long term maintenance.

@bkietz bkietz force-pushed the 1558-Implement-boolean-selection-kernels branch from 7f86e37 to 0fd9088 Compare June 10, 2019 18:49
@codecov-io
Copy link

Codecov Report

Merging #4366 into master will increase coverage by 0.92%.
The diff coverage is 92.74%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #4366      +/-   ##
=========================================
+ Coverage   88.27%   89.2%   +0.92%     
=========================================
  Files         846     697     -149     
  Lines      103662   92030   -11632     
  Branches     1253       0    -1253     
=========================================
- Hits        91512   82098    -9414     
+ Misses      11903    9932    -1971     
+ Partials      247       0     -247
Impacted Files Coverage Δ
cpp/src/arrow/compute/kernels/take.cc 91.08% <100%> (ø) ⬆️
cpp/src/arrow/compute/kernels/take-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/compute/kernels/filter-test.cc 100% <100%> (+2.58%) ⬆️
cpp/src/arrow/compute/kernels/filter.h 66.66% <66.66%> (-27.09%) ⬇️
cpp/src/arrow/compute/kernels/filter.cc 79.24% <79.24%> (-16%) ⬇️
cpp/src/arrow/compute/kernels/compare.cc 88.63% <87.5%> (+1.24%) ⬆️
cpp/src/arrow/compute/kernels/compare.h 95.65% <93.75%> (-4.35%) ⬇️
cpp/src/arrow/compute/kernels/compare-test.cc 97.41% <97.41%> (ø)
go/arrow/ipc/writer.go
go/arrow/math/uint64_amd64.go
... and 149 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bed52ed...0fd9088. Read the comment docs.

@bkietz bkietz changed the title ARROW-1558: [C++] initial mask kernel impl ARROW-1558: [C++] initial boolean filter kernel impl Jun 11, 2019
@bkietz bkietz force-pushed the 1558-Implement-boolean-selection-kernels branch from 10614e2 to 1dd6b0e Compare June 13, 2019 13:38
cpp/src/arrow/compute/kernels/filter.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/filter.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/filter.cc Outdated Show resolved Hide resolved
continue;
}
null_bitmap_builder.UnsafeAppend(true);
value_slices[filtered_i] = list_array.value_slice(i);
Copy link
Contributor

@fsaintjacques fsaintjacques Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extremely inefficient way of accomplishing this. It is untenable for large-ish arrays. Think of the number of shared_ptr construction/desctruction (I can count at least 3-4 per slices!) associated, and the ref-count on the type's shared_ptr, etc...

I prefer we simply return unsupported for this type, and/or implement it for trivial static-sized types (Numerics) where we don't need to flatmap on a vector of slices (we can then remove ListArray.value_slice). This should catch at least a good chunk of use cases.

Is suspect the proper solution (for non-trivial types) is to "expand" the bitmap by repeating the bits of the filter by the corresponding list length and recurse, like in the StructArray implementation.

I'd say remove this, and follow-up ticket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the followup ticket, it'll be a good moment to implement benchmarks for Filter and I suggest you benchmark a FixedSizeList<Int32,1> vs Int32 to see the penalty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write the benchmark for FixedSizeList now, that'd be good for this PR. I disagree that inefficiency is a reason to fail to support lists; it seems more reasonable to provide a correct implementation and let optimization be driven by demand. Furthermore, if Concatenate is inefficient then that seems like a separate issue to benchmark and improve.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, filtering of fixed_size_list(int64(), 1) is significantly slower than int64():

------------------------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
FilterInt64/32768/0                         3 us          3 us     210578 null_percent=0 size=8.192k   2.25337GB/s
FilterInt64/32768/1                         4 us          4 us     170663 null_percent=1 size=8.192k   1.86804GB/s
FilterInt64/32768/10                        4 us          4 us     167087 null_percent=10 size=8.192k    1.8274GB/s
FilterInt64/32768/50                        5 us          5 us     150915 null_percent=50 size=8.192k   1.65151GB/s
FilterFixedSizeList1Int64/32768/0          95 us         95 us       7292 null_percent=0 size=8.192k   82.5558MB/s
FilterFixedSizeList1Int64/32768/1          98 us         98 us       7148 null_percent=1 size=8.192k   79.5829MB/s
FilterFixedSizeList1Int64/32768/10        107 us        107 us       6398 null_percent=10 size=8.192k   72.7087MB/s
FilterFixedSizeList1Int64/32768/50        143 us        143 us       4862 null_percent=50 size=8.192k   54.4943MB/s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only is it slower, there's a chance there's a O(n^2) hidden in here. Try with an array of 1M elements.

Copy link
Member Author

@bkietz bkietz Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's O(n^2) at least. I'll try the expanded bitmap approach for comparison

-------------------------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------
FilterInt64/32768/0                          3 us          3 us     223833 null_percent=0 size=8.192k   2.37242GB/s
FilterInt64/32768/1                          4 us          4 us     180752 null_percent=1 size=8.192k    1.9616GB/s
FilterInt64/32768/10                         4 us          4 us     172363 null_percent=10 size=8.192k   1.88143GB/s
FilterInt64/32768/50                         5 us          5 us     153213 null_percent=50 size=8.192k   1.65837GB/s
FilterInt64/1048576/1                      247 us        247 us       2836 null_percent=1 size=262.144k   1013.64MB/s
FilterInt64/8388608/1                     1958 us       1953 us        358 null_percent=1 size=2.09715M   1024.14MB/s
FilterFixedSizeList1Int64/32768/0           94 us         94 us       7478 null_percent=0 size=8.192k   83.1916MB/s
FilterFixedSizeList1Int64/32768/1           96 us         96 us       7281 null_percent=1 size=8.192k   81.2395MB/s
FilterFixedSizeList1Int64/32768/10         104 us        104 us       6629 null_percent=10 size=8.192k   75.0306MB/s
FilterFixedSizeList1Int64/32768/50         144 us        144 us       4824 null_percent=50 size=8.192k   54.2683MB/s
FilterFixedSizeList1Int64/1048576/1       3196 us       3188 us        219 null_percent=1 size=262.144k   78.4131MB/s
FilterFixedSizeList1Int64/8388608/1      28612 us      28483 us         25 null_percent=1 size=2.09715M   70.2186MB/s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded bitmap is much faster:

-------------------------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------
FilterInt64/32768/0                          3 us          3 us     223626 null_percent=0 size=8.192k   2.37535GB/s
FilterInt64/32768/1                          4 us          4 us     175575 null_percent=1 size=8.192k    1.9186GB/s
FilterInt64/32768/10                         4 us          4 us     169822 null_percent=10 size=8.192k   1.85335GB/s
FilterInt64/32768/50                         5 us          5 us     147421 null_percent=50 size=8.192k   1.61824GB/s
FilterInt64/1048576/1                      249 us        249 us       2813 null_percent=1 size=262.144k   1003.18MB/s
FilterInt64/8388608/1                     1997 us       1997 us        351 null_percent=1 size=2.09715M   1001.68MB/s
FilterFixedSizeList1Int64/32768/0           11 us         11 us      64630 null_percent=0 size=8.192k   711.425MB/s
FilterFixedSizeList1Int64/32768/1           12 us         12 us      56194 null_percent=1 size=8.192k    627.52MB/s
FilterFixedSizeList1Int64/32768/10          12 us         12 us      56525 null_percent=10 size=8.192k   635.185MB/s
FilterFixedSizeList1Int64/32768/50          17 us         17 us      41226 null_percent=50 size=8.192k   463.433MB/s
FilterFixedSizeList1Int64/1048576/1        498 us        498 us       1414 null_percent=1 size=262.144k   501.918MB/s
FilterFixedSizeList1Int64/8388608/1       4515 us       4515 us        164 null_percent=1 size=2.09715M   442.967MB/s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark runtimes under 10 microseconds are suspicious. Are those per-iteration times or total execution time? Might want to output nanosecond unit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These times are per-iteration. Here they are with MinTime(1.0) and nanoseconds:

----------------------------------------------------------------------------------------------------------
Benchmark                                                   Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------------
FilterInt64/32768/0/min_time:1.000                       4311 ns       4311 ns     327915 null_percent=0 size=8.192k   1.76959GB/s
FilterInt64/32768/1/min_time:1.000                       5008 ns       5008 ns     277359 null_percent=1 size=8.192k   1.52345GB/s
FilterInt64/32768/10/min_time:1.000                      5108 ns       5108 ns     274582 null_percent=10 size=8.192k   1.49373GB/s
FilterInt64/32768/50/min_time:1.000                      5846 ns       5846 ns     242378 null_percent=50 size=8.192k   1.30513GB/s
FilterInt64/1048576/1/min_time:1.000                   324842 ns     324841 ns       4316 null_percent=1 size=262.144k   769.606MB/s
FilterInt64/8388608/1/min_time:1.000                  2618142 ns    2618130 ns        544 null_percent=1 size=2.09715M   763.904MB/s
FilterFixedSizeList1Int64/32768/0/min_time:1.000        11982 ns      11982 ns     114206 null_percent=0 size=8.192k   652.025MB/s
FilterFixedSizeList1Int64/32768/1/min_time:1.000        13540 ns      13540 ns     101108 null_percent=1 size=8.192k   576.992MB/s
FilterFixedSizeList1Int64/32768/10/min_time:1.000       13554 ns      13554 ns     101989 null_percent=10 size=8.192k   576.408MB/s
FilterFixedSizeList1Int64/32768/50/min_time:1.000       16951 ns      16951 ns      82888 null_percent=50 size=8.192k   460.877MB/s
FilterFixedSizeList1Int64/1048576/1/min_time:1.000     492454 ns     492453 ns       2832 null_percent=1 size=262.144k   507.663MB/s
FilterFixedSizeList1Int64/8388608/1/min_time:1.000    4483677 ns    4483659 ns        315 null_percent=1 size=2.09715M   446.064MB/s

cpp/src/arrow/compute/kernels/filter.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/filter.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/filter.cc Outdated Show resolved Hide resolved
}

private:
Status UnpackValuesNullCount(const ValueArray& values, const BooleanArray& filter,
Copy link
Member Author

@bkietz bkietz Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsaintjacques currently I only use this optimization (don't check null mask when there are no nulls) for primitive types. Do you think it's worthwhile to add it into the nested types as well?

@bkietz bkietz force-pushed the 1558-Implement-boolean-selection-kernels branch from 742ae2f to b21736d Compare June 14, 2019 20:42
NO_CHILD_CASE(Map);

case Type::STRUCT: {
std::vector<std::unique_ptr<FilterKernel>> child_kernels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized you can't do this, you need to account for the top-level parent null bitmap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm wesm changed the title ARROW-1558: [C++] initial boolean filter kernel impl ARROW-1558: [C++] Implement boolean filter (selection) kernel, rename comparison kernel-related functions Jun 18, 2019
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments -- I haven't looked too closely at the implementation details but I trust they've been reviewed adequately already. Having class FilterKernel as a public API seems to be the main thing that may need to get fixed (since it has ARROW_EXPORT on it, it seems that was the intention)

const double null_proportion;

explicit RegressionArgs(benchmark::State& state)
: size(state.range(0) / 4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is state.range(0) and why divide it by 4?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from compare-benchmark.cc and now that I look at it I think it's an error in both locations; state.range(0) should be the byte size of the memory tested and it's used that way in aggregate-benchmark.cc. @fsaintjacques can you confirm?

const int64_t memory_size = state.range(0) / 4;
const int64_t array_size = memory_size / sizeof(int64_t);
const double null_percent = static_cast<double>(state.range(1)) / 100.0;
auto rand = random::RandomArrayGenerator(0x94378165);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could put this seed in a constexpr somewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do


protected:
std::shared_ptr<DataType> type_;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean for this to be in filter.h? Can you add some docstrings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it; I was going to follow Cast() and make only the factory public but I hadn't gotten to that yet

@wesm
Copy link
Member

wesm commented Jun 18, 2019

Thanks for working on this -- will be very useful for applications

@bkietz bkietz force-pushed the 1558-Implement-boolean-selection-kernels branch from 5a224b2 to 3d92b6e Compare June 19, 2019 13:34
@wesm
Copy link
Member

wesm commented Jun 19, 2019

@bkietz can you enable Appveyor on bkietz/arrow? It can help get more timely builds sometimes

@wesm
Copy link
Member

wesm commented Jun 20, 2019

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants