Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] map key deduplication with last one wins #9124

Closed
revans2 opened this issue Aug 26, 2021 · 9 comments · Fixed by #9345
Closed

[FEA] map key deduplication with last one wins #9124

revans2 opened this issue Aug 26, 2021 · 9 comments · Fixed by #9345
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 26, 2021

Is your feature request related to a problem? Please describe.
In Spark it is configurable what to do when duplicate keys are in a Map type. By default we need to throw an exception, and I filed #9123 for that. The other option is LAST_WINS. Which means if there are duplicate keys the last values should overwrite the previous ones. We store maps as a list of Struct(key, value). So we need a way to be able to do this kind of processing on a struct.

Describe the solution you'd like
Ideally we want to break this down into smaller-reusable parts that we can combine together into what we want.

I am coming up with a few new concepts here so please feel free to correct any confusion I might have/etc.

There is already a drop_list_duplicates API. I think what I would like to see is a version of this that returns a "list gather map".

I envision a list gather map as a column of lists of int values where each list is a gather map for the another list.

For example if I had input data like

Column A:
{
[10, 9, 8, 9],
[7, 6, 6],
null,
[5, 4, 5, 3, 2, 1]
}

And a list gather map of:
{
[0, 3, 2],
[0, 2],
null,
[2, 1, 3, 4, 5]
}

I could call a list gather on it to produce...
Gathered Column A:
Column A:
{
[10, 9, 8],
[7, 6],
null,
[5, 4, 3, 2, 1]
}

But I could also have a "map" column of:
{
[(10, value-0), (9, value-1), (8, value-2), (9, value-3)],
[(7, value-0), (6, value-1), (6, value-2)],
null,
[(5, value-0), (4, value-1), (5, value-2), (3, value-3), (2, value-4), (1, value-5)]
}

and use the same list gather map to produce:
{
[(10, value-0), (9, value-3), (8, value-2)],
[(7, value-0), (6, value-2)],
null,
[(5, value-2), (4, value-1), (3, value-3), (2, value-4), (1, value-5)]
}

I have not thought through the corner cases for this type of an operation, like nulls in the list column, but not in the list gather map. But I think it should not be too hard to work them out.

This would also open up a lot of possibilities as the back end for processing of lists and maps. Things like filtering the values in a list, which is something we will need to support. Or sorting values in a list by things other than just the values in that list.

To come full circle, we would then need a drop_list_duplicates that could respect LAST_WINS vs FIRST_WINS or something like that.

Describe alternatives you've considered
I honestly cannot think of a good way to do this without some help from cudf.

Additional context
None

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 26, 2021
@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 26, 2021
@ttnghia ttnghia self-assigned this Sep 2, 2021
@jrhemstad
Copy link
Contributor

I envision a list gather map as a column of lists of int values where each list is a gather map for the another list.

I think what you want already exists: https://github.com/rapidsai/cudf/blob/branch-21.10/cpp/include/cudf/lists/gather.hpp#L30-L59

@ttnghia
Copy link
Contributor

ttnghia commented Sep 2, 2021

Oh yeah!!! Then the remaining here is to add an option "KEEP_FIST" or "KEEP_LAST" to drop_list_duplicates.

I confirmed with Bobby about this. So the second part of this issue has already been implemented (segmented_gather()). The first part as mentioned above is still relevant---We need an API that is similar to drop_list_duplicates but it returns a list gather map such that segmented_gather() operates on its output will produce the results exactly the same as drop_list_duplicates. Maybe we should call it unique_segmented_gather_map() or so.

@jrhemstad
Copy link
Contributor

Can't we just add a duplicate_keep_option argument to drop_list_duplicates?

@ttnghia
Copy link
Contributor

ttnghia commented Sep 3, 2021

No, the output of drop_list_duplicates is just lists of values. If the input is [1, 1, 1, 2], you get the output is [1, 2] regardless of the keep option.

On the other hand, if you have the output is a list gather map, you will get either [0, 3] or [2, 3] depending on you want KEEP_FIRST or KEEP_LAST.

I have no idea why stream compaction drop_duplicates ever needs the duplicate_keep_option if its output is just values? Can you provide any example?

The stream compaction API drop_duplicates operates on pair of input-keys so it can have an option duplicate_keep_option. Here, drop_list_duplicates just has one input.

@jrhemstad
Copy link
Contributor

Hm, I wonder if we can extend the key/payload idea to drop_list_duplicates. It's trickier given the nested structure would need to match.

@ttnghia
Copy link
Contributor

ttnghia commented Sep 3, 2021

This is another good approach. I just check drop_list_duplicates and see that I was implementing it by generating an index array for the unique entries then gathering list entries from the child column. If we pass in 2 inputs (keys + values) then just call gather on the child column of the values input. For the current usage, we pass in the same lists_column_view obj for both keys and values inputs.

I'm just not sure if we ever need to output the segmented_gather_map to use somewhere else. @revans2?

@revans2
Copy link
Contributor Author

revans2 commented Sep 7, 2021

I don't see another place where we would use it at this time. But if we ever do need it we can modify APIs to expose more of the underlying functionality.

@ttnghia
Copy link
Contributor

ttnghia commented Sep 7, 2021

Great. I'm going to implement that way (drop_list_duplicates will operate on 2 inputs lists columns of keys-values).
Nested types will be another topic and can always be added later.

mythrocks added a commit to mythrocks/cudf that referenced this issue Sep 10, 2021
Fixes rapidsai#9124.

Adds an overload of `extract_list_element()` where the indices
may be specified as a column_view.

This function returns a list element from a potentially different index,
for each list row. The semantics of the scalar-index version of the function
are retained. i.e.:

0. The index is 0-based.
1. if (list_row == null) return null;
2. if (index > list_row.size()) return null;
3. if (index == null) return null;
4. if (index < 0 || -index <= length) return list_row[length + index];
@mythrocks
Copy link
Contributor

(Yikes. Accidentally linked to this issue in my last PR. Please ignore.)

rapids-bot bot pushed a commit that referenced this issue Nov 11, 2021
…ns (#9345)

This PR changes the interface of `lists::drop_list_duplicates` such that it may accept a second (optional) input `values` lists column, and returns a pairs of lists columns containing the results of copying the input column without duplicate entries.

If the optional `values` column is given, the users are responsible to have the keys-values columns having the same number of entries in each row. Otherwise, the results will be undefined.

When copying the key entries, the corresponding value entries are also copied at the same time. A parameter `duplicate_keep_option` reused from stream compaction is used to specify which duplicate keys will be copying.

This closes #9124, and blocked by #9425.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - https://github.com/nvdbaranec

URL: #9345
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants