Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if a map contains a specific key [skip ci] #8209

Merged
merged 13 commits into from
May 14, 2021

Conversation

wjxiz1992
Copy link
Member

@wjxiz1992 wjxiz1992 commented May 11, 2021

To close #8120

As required in Spark 3.1.1, when ANSI mode is enabled, GetMapValue should throw an exception when the key is not found in the map in a row.
So plugin side needs to check if a map column contains the specific key in all rows.

The new added method mapContains in this PR should return a column of boolean, where false means key is not found.

@firestarman firestarman added non-breaking Non-breaking change feature request New feature or request labels May 11, 2021
@firestarman firestarman changed the title [WIP] Check if a map contains a specific key [WIP] Check if a map contains a specific key [skip ci] May 11, 2021
@@ -127,6 +127,35 @@ get_gather_map_for_map_values(column_view const &input, string_scalar &lookup_ke
} // namespace

namespace jni {

std::unique_ptr<column> map_contains(column_view const &map_column, string_scalar lookup_key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to reuse this function in map_lookup instead of duplicating codes.

* @param has_nulls Whether the input column might contain null list-rows, or null keys.
* @param stream The CUDA stream
* @param mr The device memory resource to be used for allocations
* @return A string_view column with the value from the first match in each list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for return is not correct.

@codecov
Copy link

codecov bot commented May 11, 2021

Codecov Report

Merging #8209 (bc1236f) into branch-0.20 (51336df) will decrease coverage by 0.00%.
The diff coverage is n/a.

❗ Current head bc1236f differs from pull request most recent head 141698e. Consider uploading reports for the commit 141698e to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #8209      +/-   ##
===============================================
- Coverage        82.88%   82.88%   -0.01%     
===============================================
  Files              103      104       +1     
  Lines            17668    17907     +239     
===============================================
+ Hits             14645    14843     +198     
- Misses            3023     3064      +41     
Impacted Files Coverage Δ
python/cudf/cudf/core/tools/datetimes.py 80.42% <0.00%> (-4.11%) ⬇️
python/cudf/cudf/core/column/decimal.py 90.20% <0.00%> (-2.72%) ⬇️
python/cudf/cudf/core/column/datetime.py 88.03% <0.00%> (-1.88%) ⬇️
python/cudf/cudf/core/column/struct.py 94.73% <0.00%> (-1.56%) ⬇️
python/cudf/cudf/utils/dtypes.py 82.20% <0.00%> (-1.24%) ⬇️
python/dask_cudf/dask_cudf/groupby.py 91.28% <0.00%> (-0.88%) ⬇️
python/cudf/cudf/core/series.py 91.17% <0.00%> (-0.56%) ⬇️
python/cudf/cudf/core/index.py 92.52% <0.00%> (-0.55%) ⬇️
python/cudf/cudf/core/column/column.py 88.20% <0.00%> (-0.44%) ⬇️
python/cudf/cudf/core/column/lists.py 86.98% <0.00%> (-0.43%) ⬇️
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2c67de...141698e. Read the comment docs.

@wjxiz1992 wjxiz1992 changed the title [WIP] Check if a map contains a specific key [skip ci] Check if a map contains a specific key [skip ci] May 11, 2021
@wjxiz1992 wjxiz1992 marked this pull request as ready for review May 11, 2021 11:48
@wjxiz1992 wjxiz1992 requested a review from a team as a code owner May 11, 2021 11:48
java/src/main/native/src/ColumnViewJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/map_lookup.cu Outdated Show resolved Hide resolved
java/src/main/native/src/map_lookup.cu Outdated Show resolved Hide resolved
java/src/main/native/src/map_lookup.cu Outdated Show resolved Hide resolved
java/src/main/native/src/map_lookup.cu Outdated Show resolved Hide resolved
java/src/main/native/src/map_lookup.hpp Outdated Show resolved Hide resolved
java/src/main/native/src/map_lookup.hpp Outdated Show resolved Hide resolved

auto gather_map = has_nulls ?
Copy link
Contributor

@nvdbaranec nvdbaranec May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a better alternative here. We have a list<struct<key, value>> column and we want to find whether or not each row contains the incoming key. That to me sounds like cudf::contains() if we were passing a list(string) column where the strings were just the keys. It should be possible to construct a fake column_view here that gives us this structure. Roughly:

lists_column_view lcv(map_column);
structs_column_view scv(lcv.child());
  
std::vector<column_view> children;
children.push_back(lcv.offsets());  // offsets
children.push_back(scv.child(0));   // keys (a string column)

column_view list_of_keys(map_column.type(), map_column.size(),
  nullptr, map_column.null_mask(), map_column.null_count(), 0, children);

@mythrocks is my thinking here about contains() sound?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a better alternative here. We have a list<struct<key, value>> column and we want to find whether or not each row contains the incoming key. That to me sounds like cudf::contains() if we were passing a list(string) column where the strings were just the keys. It should be possible to construct a fake column_view here that gives us this structure. Roughly:

lists_column_view lcv(map_column);
structs_column_view scv(lcv.child());
  
std::vector<column_view> children;
children.push_back(lcv.offsets());  // offsets
children.push_back(scv.child(0));   // keys (a string column)

column_view list_of_keys(map_column.type(), map_column.size(),
  nullptr, map_column.null_mask(), map_column.null_count(), 0, children);

@mythrocks is my thinking here about contains() sound?

Thank you Dave, I've updated the code according to cudf::contains().
The only change here is I manully added a null_mask(0) for the contains_column. To make nulls count in all_aggregation.

Please help review, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvdbaranec, that makes perfect sense. Thank you for the suggestion.

I'll need to examine the null mask issue more closely before I can comment.

List<HostColumnVector.StructData> list4 = Arrays.asList(new HostColumnVector.StructData("a", "g"));
List<HostColumnVector.StructData> list5 = Arrays.asList(new HostColumnVector.StructData("f", "h"));
List<HostColumnVector.StructData> list6 = Arrays.asList(new HostColumnVector.StructData("a", null));
List<HostColumnVector.StructData> list7 = Arrays.asList(new HostColumnVector.StructData(null, null));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/MapType.scala#L27
this case seems not useful, but still keep it here since we manually added a null_mask(0) for null values.

java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved
java/src/main/native/src/map_lookup.cu Outdated Show resolved Hide resolved
@wjxiz1992 wjxiz1992 requested a review from jlowe May 13, 2021 13:52
java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved
java/src/main/native/src/map_lookup.cu Outdated Show resolved Hide resolved
java/rmm_log.txt Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Outdated Show resolved Hide resolved
@wjxiz1992
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 3d12d63 into rapidsai:branch-0.20 May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] provide a method to check "if-contains-key" for a map column
6 participants