Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix filter UB and add fast path #341

Merged
merged 5 commits into from
May 26, 2021
Merged

Fix filter UB and add fast path #341

merged 5 commits into from
May 26, 2021

Conversation

ritchie46
Copy link
Contributor

@ritchie46 ritchie46 commented May 24, 2021

This also fixes UB for filter on RecordBatches. Still issue #295.

Besides this, I also added a fast path. We already do a popcount in the filter operation, it seems to me a missed opportunity to not just Arc clone the data when all values are true.

EDIT:

I also made ArrayData::new_empty public. If there is any objection to that I can make it private. IMO this should be public, as I think it should be easy to make an empty container of data structures.

@ritchie46 ritchie46 changed the title Filter Fix filter UB and add fast path May 24, 2021
@codecov-commenter
Copy link

codecov-commenter commented May 24, 2021

Codecov Report

Merging #341 (e88a3b1) into master (5ac771a) will increase coverage by 0.04%.
The diff coverage is 78.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #341      +/-   ##
==========================================
+ Coverage   82.56%   82.60%   +0.04%     
==========================================
  Files         162      162              
  Lines       44063    44199     +136     
==========================================
+ Hits        36379    36510     +131     
- Misses       7684     7689       +5     
Impacted Files Coverage Δ
arrow/src/compute/kernels/filter.rs 91.90% <78.26%> (-1.08%) ⬇️
arrow/src/array/data.rs 71.47% <100.00%> (ø)
parquet/src/column/writer.rs 93.29% <0.00%> (ø)
arrow/src/array/array_boolean.rs 90.90% <0.00%> (ø)
parquet/src/arrow/arrow_writer.rs 98.11% <0.00%> (+0.03%) ⬆️
arrow/src/csv/reader.rs 89.91% <0.00%> (+1.54%) ⬆️
arrow/src/array/transform/boolean.rs 84.61% <0.00%> (+7.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ac771a...e88a3b1. Read the comment docs.

Ok(make_array(data))
if iter.filter_count == array.len() {
let data = array.data().clone();
Ok(make_array(data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this just return array or array.clone()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dyn Array is a trait object and does not implement Sized

@alamb
Copy link
Contributor

alamb commented May 25, 2021

The MIRI failure is unrelated to this PR: #345

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ritchie46 -- This code makes sense to me. @nevi-me @Dandandan or @jorgecarleitao any thoughts?

@jorgecarleitao
Copy link
Member

Note that this is not undefined behavior as defined in Rust; the warning is just that when nulls exist, the value of the null slot will be used for filtering regardless of its validity. I.e. it is a semantic undefined behavior.

Do we have some benchmarks available? It would be good to verify that the performance improves here.

@ritchie46
Copy link
Contributor Author

Do we have some benchmarks available? It would be good to verify that the performance improves here.

Gnuplot not found, using plotters backend
all true str 10k        time:   [202.99 ns 203.24 ns 203.59 ns]                             
                        change: [-99.808% -99.807% -99.807%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

all true int 10k        time:   [187.96 ns 188.03 ns 188.09 ns]                             
                        change: [-98.627% -98.625% -98.623%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  12 (12.00%) high severe

all false str 10k       time:   [323.02 ns 323.13 ns 323.25 ns]                              
                        change: [-66.262% -64.457% -62.442%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

all false int 10k       time:   [194.24 ns 194.25 ns 194.25 ns]                              
                        change: [-58.900% -58.838% -58.787%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe

Counting ns :).

Note that this benchmark is sensitive to scale, because the filter algorithm is O(n) and the fast path O(1)

@jorgecarleitao
Copy link
Member

yeap, that I would expect. :) I was thinking about the case where there is a null mask and non-null values.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! This has great consequences. E.g. it removes the need to constant remove filters in DataFusion,.

Thanks a lot @ritchie46 !

arrow/src/compute/kernels/filter.rs Outdated Show resolved Hide resolved
@Dandandan
Copy link
Contributor

Looks great! This has great consequences. E.g. it removes the need to constant remove filters in DataFusion,.

Thanks a lot @ritchie46 !

Curious, what do you mean by constant remove filters?

rename argument: 'filter' to 'predicate'
to reduce name collissions.
@jorgecarleitao
Copy link
Member

I mean that we no longer need to try to prune expressions of the form 1 == 1: they automatically get converted to trues, which amounts to a O(1) evaluation in the filter. (We may still do it to make the plan cleaner)

@ritchie46
Copy link
Contributor Author

ritchie46 commented May 26, 2021

I mean that we no longer need to try to prune expressions of the form 1 == 1: they automatically get converted to trues, which amounts to a O(1) evaluation in the filter. (We may still do it to make the plan cleaner)

This was the same reason for met doing the PR.

I paid a relatively heavy price for filtering null values on a DataFrame that did not have any nulls (so all filter values were true).

@Dandandan
Copy link
Contributor

Dandandan commented May 26, 2021

I mean that we no longer need to try to prune expressions of the form 1 == 1: they automatically get converted to trues, which amounts to a O(1) evaluation in the filter. (We may still do it to make the plan cleaner)

👍
I am not sure the call is O(1), the popcount/bitcount still has to be computed, no? Still faster than the other parts of course.

@ritchie46
Copy link
Contributor Author

I am not sure the call is O(1), the popcount/bitcount still has to be computed, no? Still faster than the other parts of course.

The call not indeed. Only the fast paths.

@alamb alamb merged commit e85dc98 into apache:master May 26, 2021
@alamb
Copy link
Contributor

alamb commented May 26, 2021

Thanks @ritchie46 !

alamb pushed a commit that referenced this pull request May 26, 2021
* fix ub in filter record_batch

* filter fast path

* add all false fast path

* use new_empty_array

* rename filter kernel argument

rename argument: 'filter' to 'predicate'
to reduce name collissions.
@Dandandan
Copy link
Contributor

Dandandan commented May 27, 2021

This PR and benchmark got me thinking - it is faster to clone an array than to create the equivalent null array from scratch (as can be seen in micro benchmarks above).
But we could clone the contents of the array and only create the zeroed null buffer.

It can increase memory usage though - as the contents might be big, for string arrays, which could be deallocated otherwise if it has a reference count of 0. On the other hand - this is something we do too for slice and other kernels.

@alamb
Copy link
Contributor

alamb commented May 27, 2021

But we could clone the contents of the array and only create the zeroed null buffer.

I think this makes a lot of sense personally (even if it might increase the peak memory usage of the system). Most filter masks are applied immediately (e.g. as part of some filtering operation) so I think so the amount of additional time that a large string array would be held on to is likely minimal

alamb added a commit that referenced this pull request May 27, 2021
* fix ub in filter record_batch

* filter fast path

* add all false fast path

* use new_empty_array

* rename filter kernel argument

rename argument: 'filter' to 'predicate'
to reduce name collissions.

Co-authored-by: Ritchie Vink <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants