[FEA] map key deduplication with last one wins #9124

revans2 · 2021-08-26T14:12:28Z

Is your feature request related to a problem? Please describe.
In Spark it is configurable what to do when duplicate keys are in a Map type. By default we need to throw an exception, and I filed #9123 for that. The other option is LAST_WINS. Which means if there are duplicate keys the last values should overwrite the previous ones. We store maps as a list of Struct(key, value). So we need a way to be able to do this kind of processing on a struct.

Describe the solution you'd like
Ideally we want to break this down into smaller-reusable parts that we can combine together into what we want.

I am coming up with a few new concepts here so please feel free to correct any confusion I might have/etc.

There is already a drop_list_duplicates API. I think what I would like to see is a version of this that returns a "list gather map".

I envision a list gather map as a column of lists of int values where each list is a gather map for the another list.

For example if I had input data like

Column A:
{
[10, 9, 8, 9],
[7, 6, 6],
null,
[5, 4, 5, 3, 2, 1]
}

And a list gather map of:
{
[0, 3, 2],
[0, 2],
null,
[2, 1, 3, 4, 5]
}

I could call a list gather on it to produce...
Gathered Column A:
Column A:
{
[10, 9, 8],
[7, 6],
null,
[5, 4, 3, 2, 1]
}

But I could also have a "map" column of:
{
[(10, value-0), (9, value-1), (8, value-2), (9, value-3)],
[(7, value-0), (6, value-1), (6, value-2)],
null,
[(5, value-0), (4, value-1), (5, value-2), (3, value-3), (2, value-4), (1, value-5)]
}

and use the same list gather map to produce:
{
[(10, value-0), (9, value-3), (8, value-2)],
[(7, value-0), (6, value-2)],
null,
[(5, value-2), (4, value-1), (3, value-3), (2, value-4), (1, value-5)]
}

I have not thought through the corner cases for this type of an operation, like nulls in the list column, but not in the list gather map. But I think it should not be too hard to work them out.

This would also open up a lot of possibilities as the back end for processing of lists and maps. Things like filtering the values in a list, which is something we will need to support. Or sorting values in a list by things other than just the values in that list.

To come full circle, we would then need a drop_list_duplicates that could respect LAST_WINS vs FIRST_WINS or something like that.

Describe alternatives you've considered
I honestly cannot think of a good way to do this without some help from cudf.

Additional context
None

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-09-02T18:50:06Z

I envision a list gather map as a column of lists of int values where each list is a gather map for the another list.

I think what you want already exists: https://github.com/rapidsai/cudf/blob/branch-21.10/cpp/include/cudf/lists/gather.hpp#L30-L59

ttnghia · 2021-09-02T18:52:54Z

~~Oh yeah!!! Then the remaining here is to add an option "KEEP_FIST" or "KEEP_LAST" to drop_list_duplicates.~~

I confirmed with Bobby about this. So the second part of this issue has already been implemented (segmented_gather()). The first part as mentioned above is still relevant---We need an API that is similar to drop_list_duplicates but it returns a list gather map such that segmented_gather() operates on its output will produce the results exactly the same as drop_list_duplicates. Maybe we should call it unique_segmented_gather_map() or so.

jrhemstad · 2021-09-03T19:12:28Z

Can't we just add a duplicate_keep_option argument to drop_list_duplicates?

ttnghia · 2021-09-03T19:14:43Z

No, the output of drop_list_duplicates is just lists of values. If the input is [1, 1, 1, 2], you get the output is [1, 2] regardless of the keep option.

On the other hand, if you have the output is a list gather map, you will get either [0, 3] or [2, 3] depending on you want KEEP_FIRST or KEEP_LAST.

~~I have no idea why stream compaction drop_duplicates ever needs the duplicate_keep_option if its output is just values? Can you provide any example?~~

The stream compaction API drop_duplicates operates on pair of input-keys so it can have an option duplicate_keep_option. Here, drop_list_duplicates just has one input.

jrhemstad · 2021-09-03T19:19:52Z

Hm, I wonder if we can extend the key/payload idea to drop_list_duplicates. It's trickier given the nested structure would need to match.

ttnghia · 2021-09-03T21:29:04Z

This is another good approach. I just check drop_list_duplicates and see that I was implementing it by generating an index array for the unique entries then gathering list entries from the child column. If we pass in 2 inputs (keys + values) then just call gather on the child column of the values input. For the current usage, we pass in the same lists_column_view obj for both keys and values inputs.

I'm just not sure if we ever need to output the segmented_gather_map to use somewhere else. @revans2?

revans2 · 2021-09-07T16:38:41Z

I don't see another place where we would use it at this time. But if we ever do need it we can modify APIs to expose more of the underlying functionality.

ttnghia · 2021-09-07T18:03:22Z

Great. I'm going to implement that way (drop_list_duplicates will operate on 2 inputs lists columns of keys-values).
Nested types will be another topic and can always be added later.

Fixes rapidsai#9124. Adds an overload of `extract_list_element()` where the indices may be specified as a column_view. This function returns a list element from a potentially different index, for each list row. The semantics of the scalar-index version of the function are retained. i.e.: 0. The index is 0-based. 1. if (list_row == null) return null; 2. if (index > list_row.size()) return null; 3. if (index == null) return null; 4. if (index < 0 || -index <= length) return list_row[length + index];

mythrocks · 2021-09-10T14:36:48Z

(Yikes. Accidentally linked to this issue in my last PR. Please ignore.)

…ns (#9345) This PR changes the interface of `lists::drop_list_duplicates` such that it may accept a second (optional) input `values` lists column, and returns a pairs of lists columns containing the results of copying the input column without duplicate entries. If the optional `values` column is given, the users are responsible to have the keys-values columns having the same number of entries in each row. Otherwise, the results will be undefined. When copying the key entries, the corresponding value entries are also copied at the same time. A parameter `duplicate_keep_option` reused from stream compaction is used to specify which duplicate keys will be copying. This closes #9124, and blocked by #9425. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - https://github.com/nvdbaranec URL: #9345

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 26, 2021

revans2 mentioned this issue Aug 26, 2021

[FEA] Make CreateMap duplicate key handling compatible with Spark and enable CreateMap by default NVIDIA/spark-rapids#3250

Closed

beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 26, 2021

ttnghia self-assigned this Sep 2, 2021

ttnghia mentioned this issue Sep 2, 2021

[FEA] Implement extract_list_element() with column_view indices #9172

Closed

mythrocks mentioned this issue Sep 10, 2021

extract_list_element() with column_view indices #9214

Closed

3 tasks

ttnghia mentioned this issue Sep 29, 2021

Reimplement lists::drop_list_duplicates for keys-values lists columns #9345

Merged

bdice mentioned this issue Oct 12, 2021

Fix stream compaction's drop_duplicates API to use stable sort #9417

Merged

rapids-bot bot closed this as completed in #9345 Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] map key deduplication with last one wins #9124

[FEA] map key deduplication with last one wins #9124

revans2 commented Aug 26, 2021 •

edited

Loading

jrhemstad commented Sep 2, 2021

ttnghia commented Sep 2, 2021 •

edited

Loading

jrhemstad commented Sep 3, 2021

ttnghia commented Sep 3, 2021 •

edited

Loading

jrhemstad commented Sep 3, 2021

ttnghia commented Sep 3, 2021 •

edited

Loading

revans2 commented Sep 7, 2021

ttnghia commented Sep 7, 2021

mythrocks commented Sep 10, 2021

[FEA] map key deduplication with last one wins #9124

[FEA] map key deduplication with last one wins #9124

Comments

revans2 commented Aug 26, 2021 • edited Loading

jrhemstad commented Sep 2, 2021

ttnghia commented Sep 2, 2021 • edited Loading

jrhemstad commented Sep 3, 2021

ttnghia commented Sep 3, 2021 • edited Loading

jrhemstad commented Sep 3, 2021

ttnghia commented Sep 3, 2021 • edited Loading

revans2 commented Sep 7, 2021

ttnghia commented Sep 7, 2021

mythrocks commented Sep 10, 2021

revans2 commented Aug 26, 2021 •

edited

Loading

ttnghia commented Sep 2, 2021 •

edited

Loading

ttnghia commented Sep 3, 2021 •

edited

Loading

ttnghia commented Sep 3, 2021 •

edited

Loading