Implement `lists::index_of()` to find positions in list rows #9510

mythrocks · 2021-10-23T00:32:20Z

Prelude

lists::contains() (introduced in #7039) returns a BOOL8 column, indicating whether the specified search_key(s) exist at all in each corresponding list row of an input LIST column. It does not return the actual position.

`index_of()`

This commit introduces lists::index_of(), to return the INT32 positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either FIND_FIRST (which finds the position of the first occurrence), or FIND_LAST (which finds the last occurrence). Both column_view and scalar search keys are supported.

As with lists::contains(), nested types are not supported as search keys in lists::index_of().

If the search_key cannot be found, that output row is set to -1. Additionally, the row output[i] is set to null if:

The search_key(scalar) or search_keys[i](column_view) is null.
The list row lists[i] is null

In all other cases, output[i] should contain a non-negative value.

Semantic changes for `lists::contains()`

This commit also modifies the semantics of lists::contains(): it will now return nulls only for the following cases:

The search_key(scalar) or search_keys[i](column_view) is null.
The list row lists[i] is null

In all other cases, a non-null bool is returned. Specifically lists::contains() no longer conforms to SQL semantics of returning NULL for list rows that don't contain the search key, while simultaneously containing nulls. In this case, false is returned.

`lists::contains_null_elements()`

A new function has been introduced to check if each list row contains null elements. The semantics are similar to lists::contains(), in that the column returned is BOOL8 typed:

If even 1 element in a list row is null, the returned row is true.
If no element is null, the returned row is false.
If the list row is null, the returned row is null.
If the list row is empty, the returned row is false.

The current implementation is an inefficient placeholder, to be replaced once (#9588) is available. It is included here to reconstruct the SQL semantics dropped from lists::contains().

mythrocks · 2021-10-23T00:37:50Z

index_of() and contains() share a backend implementation. There is a difference in semantics between the two in the following case:

The list row does not contain the search key. And...
The list row also contains null values

For any list row where both the above hold true:

contains() returns null for that row.
index_of() returns -1, indicating that the row wasn't found.

Care needed to be taken to support both cases in the same code.

codecov · 2021-10-23T01:55:07Z

Codecov Report

Merging #9510 (a5beb0f) into branch-22.02 (967a333) will decrease coverage by 0.08%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9510      +/-   ##
================================================
- Coverage         10.49%   10.40%   -0.09%     
================================================
  Files               119      119              
  Lines             20305    20507     +202     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18373     +198

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`0.00% <0.00%> (ø)`
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c5a85a...a5beb0f. Read the comment docs.

jrhemstad · 2021-10-25T14:15:58Z

If the search_key cannot be found, that output row is set to -1.

Why is that? Shouldn't it be made null?

mythrocks · 2021-10-25T18:07:38Z

Why is that? Shouldn't it be made null?

This is so that we are able to support the SQL semantics of array_index() in Spark, (Teradata, and possibly others) which make the following distinction:

If list row is null, or the search key is null, return null.
If neither is null, return a non-null index indicating "valid, but not found".

It would be difficult disambiguate -1 from null if both conditions were to return null. Replacing -1 with null if necessary as an after step seemed easier.

The SQL semantics are annoying in various flavours:

The array_index() indices are 1-based. spark-rapids is going to have to add 1 to the result of lists::index_of().
array_contains() does not follow the semantics of array_index(), as described here.

mythrocks · 2021-10-25T18:14:27Z

The end goal includes being able to use lists::index_of() for key lookups in maps (i.e. LIST<STRUCT<key, value>>.
I hope to do this with:

auto indices = lists::index_of(key_list, key_scalar, FIND_LAST);
auto values  = lists::extract_list_element(value_list, *indices);

The snag is that indices will need to be preprocessed to replace nulls and -1 with MAXINT.

jrhemstad · 2021-10-25T21:37:01Z

There is a difference in semantics between the two in the following case:

The list row does not contain the search key. And...

The list row also contains null values

For any list row where both the above hold true:

contains() returns null for that row.

index_of() returns -1, indicating that the row wasn't found.

To be clear, is this the case when the list contains one or more null values? Or the list contains all null values?

In other words,

key: 5
list0 : [1, 2, 3, 4]
list1: [1, 2, NULL, 4]
list2: [ NULL, NULL]

What does contains return for these lists?

mythrocks · 2021-10-25T22:13:07Z

What does contains return for these lists?

Apologies for not making this clearer.
The described semantics are for when even one null is found in a LIST row. Ergo:

auto input             = [ [1,2,3,4], [1,2,NULL,4], [NULL,NULL], [1,NULL,5], [],    NULL ];
contains(input, 5)    == [   false,      NULL,       NULL,         true,     false, NULL ];
index_of(input, 5)    == [   -1,          -1,          -1,            2,     -1,    NULL ];

@revans2 will keep me honest here, in how this is congruent with the semantics of the equivalent SQL.

jrhemstad · 2021-10-25T23:32:52Z

Wow, these semantics are completely bonkers. There's no way we should repeat this behavior in C++.

I understand the need to differentiate between "key not found" and "list is null", but the behavior described is horribly non-intuitive and inconsistent.

Here's what I suggest the behavior should be:

auto input             = [ [1,2,3,4], [1,2,NULL,4], [NULL,NULL], [1,NULL,5], [],    NULL ];
contains(input, 5)    == [   false,      false,       false,         true,     false, NULL ];
index_of(input, 5)    == [   -1,          -1,          -1,            2,     -1,    NULL ];

The previously described null behavior of contains can be emulated by a contains_nulls API for lists that returns true/false if a list contains nulls. This result can be combined with the results from contains to give the desired output.

mythrocks · 2021-10-26T00:08:02Z

can be emulated by a contains_nulls API...

Hmm. Now that you point it out, it does sound logical to break it up this way. We can maintain the same semantics between contains() and index_of() (and contains_null). That makes sense.

mythrocks · 2021-10-28T22:57:56Z

The previously described null behavior of contains can be emulated...

I should call out that lists::contains() already has this null behaviour. spark-rapids depends on this, starting from NVIDIA/spark-rapids/pull/1565. Alteration of this behaviour would be a breaking change.

jrhemstad · 2021-10-28T23:28:17Z

I should call out that lists::contains() already has this null behaviour.

Had I noticed it originally, I wouldn't have let it be merged in the first place ;)

Alteration of this behaviour would be a breaking change.

Break away! Just use the right label.

mythrocks · 2021-11-04T00:09:43Z

I'm taking this PR out of 21.12, given that it'll break compat. I should have this early in the next release.

`lists::contains()` (introduced in rapidsai#7039) returns a `BOOL8` column, indicating whether the specified search_key(s) exist at all in each corresponding list row of an input LIST column. It does not return the actual position. This commit introduces `lists::index_of()`, to return the INT32 positions of the specified search_key(s) in a LIST column. The search keys may be searched for using either `FIND_FIRST` (which finds the position of the first occurrence), or `FIND_LAST` (which finds the last occurrence). Both column_view and scalar search keys are supported. As with `lists::contains()`, nested types are not supported as search keys is `lists::index_of()`. If the search_key cannot be found, that output row is set to `-1`. Additionally, the row `output[i]` is set to null if: 1. The search_key(scalar) or search_keys[i](column_view) is null. 2. The list row `lists[i]` is null In all other cases, `output[i]` should contain a non-negative value.

Signed-off-by: MithunR <[email protected]>

mythrocks · 2021-12-09T08:50:51Z

Rerun tests

mythrocks · 2021-12-09T19:31:46Z

#9870 seems to have fixed the cmake-format failures. It should now be possible to rerun tests.

java/src/main/java/ai/rapids/cudf/ColumnView.java

java/src/main/native/src/ColumnViewJni.cpp

1. `const` all the things. 2. Overload function names.

mythrocks · 2021-12-14T05:51:35Z

Applied DO NOT MERGE label. Currently, merging this would break spark-rapids.

I'm working on an additional JNI change, to allow setting a column's null-mask to the result of a boolean operation.

mythrocks · 2021-12-16T18:07:32Z

I'm working on an additional JNI change, to allow setting a column's null-mask to the result of a boolean operation.

Filed here.

Removed DO NOT MERGE label. The spark-rapids plugin code change to accept this breakage is ready.

@codereport, I wonder if you might have a moment to review.

codereport

This code looks really good 🔥 Very "expression oriented" and easy to read! Only comments are aesthetic

cpp/include/cudf/lists/contains.hpp

cpp/src/lists/contains.cu

mythrocks · 2021-12-20T16:26:20Z

@gpucibot merge

mythrocks · 2021-12-20T16:26:54Z

I've merged this change now.
Thank you all for the reviews.

* Accommodate altered semantics of `cudf::lists::contains()` rapidsai/cudf/pull/9510 changes the semantics of lists::contains(), with regard to rows containing nulls. Specifically, if a list row contains at least one null element, and is found not to contain the search key, libcudf will now return false instead of null. SparkSQL expects to return null in those cases. This commit accommodates the change in libcudf's semantics, to keep its own existing behaviour. Signed-off-by: MithunR <[email protected]>

mythrocks requested a review from a team as a code owner October 23, 2021 00:32

mythrocks requested review from harrism and codereport October 23, 2021 00:32

mythrocks self-assigned this Oct 23, 2021

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 23, 2021

mythrocks added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Oct 23, 2021

mythrocks marked this pull request as draft October 26, 2021 00:08

mythrocks added 4 commits November 23, 2021 14:34

Change contains() semantics to index_of().

2c69cad

Removed is_found where not required.

5283b2b

Ignore unused variables in structured binding.

c961d14

Signed-off-by: MithunR <[email protected]>

mythrocks force-pushed the lists-index-of branch from bff7680 to c961d14 Compare November 23, 2021 23:25

github-actions bot added the CMake CMake build issue label Nov 23, 2021

mythrocks added 2 commits December 8, 2021 17:02

Collapse finder<> definitions.

53744ad

Fix function name in Java.

e07393f

mythrocks requested a review from harrism December 9, 2021 08:50

mythrocks requested a review from brandon-b-miller December 9, 2021 19:43

mythrocks added 2 commits December 10, 2021 11:43

Better variable names.

c304684

Merge remote-tracking branch 'origin/branch-22.02' into lists-index-of

6a744c8

jlowe reviewed Dec 13, 2021

View reviewed changes

java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved

java/src/main/native/src/ColumnViewJni.cpp Show resolved Hide resolved

java/src/main/native/src/ColumnViewJni.cpp Outdated Show resolved Hide resolved

JNI Review changes.

6a04e66

1. `const` all the things. 2. Overload function names.

jlowe approved these changes Dec 13, 2021

View reviewed changes

mythrocks requested a review from jlowe December 13, 2021 23:22

jlowe approved these changes Dec 13, 2021

View reviewed changes

harrism approved these changes Dec 14, 2021

View reviewed changes

mythrocks added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 14, 2021

mythrocks mentioned this pull request Dec 14, 2021

Accommodate altered semantics of cudf::lists::contains() NVIDIA/spark-rapids#4361

Merged

mythrocks removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 16, 2021

codereport approved these changes Dec 17, 2021

View reviewed changes

cpp/include/cudf/lists/contains.hpp Outdated Show resolved Hide resolved

cpp/src/lists/contains.cu Outdated Show resolved Hide resolved

mythrocks added 2 commits December 17, 2021 09:51

Merge remote-tracking branch 'origin/branch-22.02' into lists-index-of

043c111

Review: Formatting for comments, doxygen.

a5beb0f

mythrocks added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Dec 17, 2021

rapids-bot bot merged commit a4dc42d into rapidsai:branch-22.02 Dec 20, 2021

mythrocks mentioned this pull request Jun 7, 2022

Refactor lists::contains #11019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `lists::index_of()` to find positions in list rows #9510

Implement `lists::index_of()` to find positions in list rows #9510

mythrocks commented Oct 23, 2021 •

edited

Loading

mythrocks commented Oct 23, 2021 •

edited

Loading

codecov bot commented Oct 23, 2021 •

edited

Loading

jrhemstad commented Oct 25, 2021

mythrocks commented Oct 25, 2021 •

edited

Loading

mythrocks commented Oct 25, 2021

jrhemstad commented Oct 25, 2021

mythrocks commented Oct 25, 2021

jrhemstad commented Oct 25, 2021

mythrocks commented Oct 26, 2021

mythrocks commented Oct 28, 2021 •

edited

Loading

jrhemstad commented Oct 28, 2021

mythrocks commented Nov 4, 2021

mythrocks commented Dec 9, 2021

mythrocks commented Dec 9, 2021

mythrocks commented Dec 14, 2021

mythrocks commented Dec 16, 2021 •

edited

Loading

codereport left a comment

mythrocks commented Dec 20, 2021

mythrocks commented Dec 20, 2021

Implement lists::index_of() to find positions in list rows #9510

Implement lists::index_of() to find positions in list rows #9510

Conversation

mythrocks commented Oct 23, 2021 • edited Loading

Prelude

index_of()

Semantic changes for lists::contains()

lists::contains_null_elements()

mythrocks commented Oct 23, 2021 • edited Loading

codecov bot commented Oct 23, 2021 • edited Loading

Codecov Report

jrhemstad commented Oct 25, 2021

mythrocks commented Oct 25, 2021 • edited Loading

mythrocks commented Oct 25, 2021

jrhemstad commented Oct 25, 2021

mythrocks commented Oct 25, 2021

jrhemstad commented Oct 25, 2021

mythrocks commented Oct 26, 2021

mythrocks commented Oct 28, 2021 • edited Loading

jrhemstad commented Oct 28, 2021

mythrocks commented Nov 4, 2021

mythrocks commented Dec 9, 2021

mythrocks commented Dec 9, 2021

mythrocks commented Dec 14, 2021

mythrocks commented Dec 16, 2021 • edited Loading

codereport left a comment

Choose a reason for hiding this comment

mythrocks commented Dec 20, 2021

mythrocks commented Dec 20, 2021

Implement `lists::index_of()` to find positions in list rows #9510

Implement `lists::index_of()` to find positions in list rows #9510

mythrocks commented Oct 23, 2021 •

edited

Loading

`index_of()`

Semantic changes for `lists::contains()`

`lists::contains_null_elements()`

mythrocks commented Oct 23, 2021 •

edited

Loading

codecov bot commented Oct 23, 2021 •

edited

Loading

mythrocks commented Oct 25, 2021 •

edited

Loading

mythrocks commented Oct 28, 2021 •

edited

Loading

mythrocks commented Dec 16, 2021 •

edited

Loading