-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] map key deduplication with last one wins #9124
Comments
I think what you want already exists: https://github.com/rapidsai/cudf/blob/branch-21.10/cpp/include/cudf/lists/gather.hpp#L30-L59 |
I confirmed with Bobby about this. So the second part of this issue has already been implemented ( |
Can't we just add a |
No, the output of On the other hand, if you have the output is a list gather map, you will get either
The stream compaction API |
Hm, I wonder if we can extend the key/payload idea to |
This is another good approach. I just check I'm just not sure if we ever need to output the |
I don't see another place where we would use it at this time. But if we ever do need it we can modify APIs to expose more of the underlying functionality. |
Great. I'm going to implement that way ( |
Fixes rapidsai#9124. Adds an overload of `extract_list_element()` where the indices may be specified as a column_view. This function returns a list element from a potentially different index, for each list row. The semantics of the scalar-index version of the function are retained. i.e.: 0. The index is 0-based. 1. if (list_row == null) return null; 2. if (index > list_row.size()) return null; 3. if (index == null) return null; 4. if (index < 0 || -index <= length) return list_row[length + index];
(Yikes. Accidentally linked to this issue in my last PR. Please ignore.) |
…ns (#9345) This PR changes the interface of `lists::drop_list_duplicates` such that it may accept a second (optional) input `values` lists column, and returns a pairs of lists columns containing the results of copying the input column without duplicate entries. If the optional `values` column is given, the users are responsible to have the keys-values columns having the same number of entries in each row. Otherwise, the results will be undefined. When copying the key entries, the corresponding value entries are also copied at the same time. A parameter `duplicate_keep_option` reused from stream compaction is used to specify which duplicate keys will be copying. This closes #9124, and blocked by #9425. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - https://github.com/nvdbaranec URL: #9345
Is your feature request related to a problem? Please describe.
In Spark it is configurable what to do when duplicate keys are in a Map type. By default we need to throw an exception, and I filed #9123 for that. The other option is LAST_WINS. Which means if there are duplicate keys the last values should overwrite the previous ones. We store maps as a list of Struct(key, value). So we need a way to be able to do this kind of processing on a struct.
Describe the solution you'd like
Ideally we want to break this down into smaller-reusable parts that we can combine together into what we want.
I am coming up with a few new concepts here so please feel free to correct any confusion I might have/etc.
There is already a
drop_list_duplicates
API. I think what I would like to see is a version of this that returns a "list gather map".I envision a list gather map as a column of lists of int values where each list is a gather map for the another list.
For example if I had input data like
Column A:
{
[10, 9, 8, 9],
[7, 6, 6],
null,
[5, 4, 5, 3, 2, 1]
}
And a list gather map of:
{
[0, 3, 2],
[0, 2],
null,
[2, 1, 3, 4, 5]
}
I could call a list gather on it to produce...
Gathered Column A:
Column A:
{
[10, 9, 8],
[7, 6],
null,
[5, 4, 3, 2, 1]
}
But I could also have a "map" column of:
{
[(10, value-0), (9, value-1), (8, value-2), (9, value-3)],
[(7, value-0), (6, value-1), (6, value-2)],
null,
[(5, value-0), (4, value-1), (5, value-2), (3, value-3), (2, value-4), (1, value-5)]
}
and use the same list gather map to produce:
{
[(10, value-0), (9, value-3), (8, value-2)],
[(7, value-0), (6, value-2)],
null,
[(5, value-2), (4, value-1), (3, value-3), (2, value-4), (1, value-5)]
}
I have not thought through the corner cases for this type of an operation, like nulls in the list column, but not in the list gather map. But I think it should not be too hard to work them out.
This would also open up a lot of possibilities as the back end for processing of lists and maps. Things like filtering the values in a list, which is something we will need to support. Or sorting values in a list by things other than just the values in that list.
To come full circle, we would then need a drop_list_duplicates that could respect LAST_WINS vs FIRST_WINS or something like that.
Describe alternatives you've considered
I honestly cannot think of a good way to do this without some help from cudf.
Additional context
None
The text was updated successfully, but these errors were encountered: