Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] provide a method to check "if-contains-key" for a map column #8120

Closed
wjxiz1992 opened this issue Apr 30, 2021 · 5 comments · Fixed by #8209
Closed

[FEA] provide a method to check "if-contains-key" for a map column #8120

wjxiz1992 opened this issue Apr 30, 2021 · 5 comments · Fixed by #8209
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@wjxiz1992
Copy link
Member

wjxiz1992 commented Apr 30, 2021

Is your feature request related to a problem? Please describe.
Spark 3.1.1 introduced ansi_mode operation for GetMapValue and [ElementAt].
When dealing with this case, I need to know if the map contains the specific key. This is needed for NVIDIA/spark-rapids#2272

Current cuDF API map_lookup will return "null" when the key is not found in the map. And this will cause a confusing result, when the map actually contains the specific key but this key maps to a value of "null"(e.g. {"a": null}). I cannot know what the "null" means when I got a ColumnVector that contains "null"

Describe the solution you'd like
provide a method like map_contains that will return a column vector containing bool values for each row.

Describe alternatives you've considered
Any solution helps me identify the meaning of null is fine.

@wjxiz1992 wjxiz1992 added feature request New feature or request Needs Triage Need team to review and classify labels Apr 30, 2021
@harrism
Copy link
Member

harrism commented May 4, 2021

@sameerz can you help clarify this issue?

@harrism harrism added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels May 4, 2021
@wjxiz1992
Copy link
Member Author

Hi @harrism , this is a use of Spark expression "element_at" from one customer.
In latest Spark version 3.1.1, this expression brings in a new verification for the key in a map:

  • if the key doesn't exist, throw exception.

For our current implementation based on 3.0.0 version:

  • if the key doesn't exist, return null as value for it.

So the goal for this issue is to match the 3.1.1 Spark behaviour -- which requires the ability to detect if the key exists or not.

I'm new to cuDF, and not familiar with the whole API situation, I only found the map_contains that is doing something with map. And it will return null automatically when they key is not there.

Normally I can say, when I got a null, I throw exception. However, Spark allows a map like {"a": null}. So when I've got a "null" in my hand, I don't know if I should throw an exception, because this null can also be a value for the key.

Appreciate for your suggestions.

@jrhemstad
Copy link
Contributor

libcudf doesn't have a notion of a "map" column. I don't know how the current Java map_lookup is implemented.

Seems like the contains API is sufficient here.

bool contains(column_view const& col, scalar const& value);

@wjxiz1992
Copy link
Member Author

wjxiz1992 commented May 11, 2021

Hi @nvdbaranec , I looked more into the map_lookup related code and found the basic logic has been implemented in get_gather_map_for_map_values, it will call search_each_list and return corresponding indexes as a new column. When not found, -1 will be returned as index. When I get the column of ints, I will check if it contains -1. So that I can tell if the key is contained in that map.

I am trying to extract part of it and make this part a new JNI method like "mapContains".

Is this thought logical? I put the draft code in #8209 . Could you help take a look at this and provide more insights. Thanks a lot!

@nvdbaranec
Copy link
Contributor

Added some comments in #8209

@harrism harrism changed the title [FEA] provide a method to check "if-cotains-key" for a map column [FEA] provide a method to check "if-contains-key" for a map column May 11, 2021
rapids-bot bot pushed a commit that referenced this issue May 14, 2021
To close #8120

As required in Spark 3.1.1, when ANSI mode is enabled, GetMapValue should throw an exception when the key is not found in the map in a row. 
So plugin side needs to check if a map column contains the specific key in all rows.

The new added method `mapContains` in this PR should return a column of boolean, where _false_ means key is not found.

Authors:
  - Allen Xu (https://github.com/wjxiz1992)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #8209
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants