[FEA] List duplication detection #9123

revans2 · 2021-08-26T13:43:38Z

Is your feature request related to a problem? Please describe.
In Spark we have a requirement to be able to detect and fail if there are duplicate keys in a map. We store maps as a list of Struct(key, value). It is fairly simple to pull the keys out of this as a list so really what we need is the ability to tell if there are duplicate values in any list in a column.

Describe the solution you'd like
I am not 100% sure of the best solution here. For flexibility it might be nice to break the problem down into a list_contains_duplicates method that returns a boolean for each list that has a duplicate in it, and then we can do an any reduction on it to get the final answer. But I don't know if anyone else has a similar requirement.

Describe alternatives you've considered
We could do a segmented sort of the values. Then pull out just the data column and try to do a windowed lead 1 (but I am not sure we can do that because we have to have a way to use the offsets to set the window boundaries). With that we can check for equality between the two entries and finally do an any reduction (ignoring nulls). That sounds overly complicated and I am not 100% sure we can even do that without help from cudf.

Additional context
The behavior of what to do on duplicate keys is configurable, but this is the default so it is the most important for us to implement. I will file a separate issue to support last value wins.

jrhemstad · 2021-09-03T04:36:34Z

You could do a count_elements + drop_list_duplicates + count_elements and then check if the results of the two count_elements are different.

I admit that's a little expensive, but the proposed list_contains_duplicates feels a little ad hoc.

ttnghia · 2021-09-03T13:29:31Z

Well, for efficiency (in term of performance) I'm going to just do a segmented sort then linear scan of each adjacent list pair.

jrhemstad · 2021-09-03T13:37:01Z

Well, for efficiency (in term of performance) I'm going to just do a linear scan of each list and compare the first entry to the other entries.

Wouldn't you need to compare the second element to the other entries, and the third, etc?

Furthermore, this would still necessitate adding a new function to libcudf for a very specific use case that does not seem to generalize well. That increases compile time, binary size, interface surface area, maintenance cost, etc. I would prefer we attempt to use existing APIs to achieve the desired end and only if benchmarking shows that approach to be a problem would we revisit trying to find a more efficient and general purpose primitive.

jrhemstad · 2021-09-03T13:39:41Z

we have a requirement to be able to detect and fail if there are duplicate keys in a map

I'm curious when/how this situation could arise. The map column can only come from reading a file (which I would assume would have error checked the duplicity when it was written?) or from the result of some other operation that would have preserved the uniqueness of keys, right?

ttnghia · 2021-09-03T13:47:17Z

Well, for efficiency (in term of performance) I'm going to just do a linear scan of each list and compare the first entry to the other entries.

Wouldn't you need to compare the second element to the other entries, and the third, etc?

Sorry I edited my post to something else. Fuzzy brain in the morning after waking up before ifood.

jrhemstad · 2021-09-03T13:58:37Z

If the proposed implementation requires a segmented sort anyways, I'm less concerned about the overhead of the count_elements/drop_list_duplicates solution I mentioned above and we should try that first.

revans2 · 2021-09-03T15:32:30Z

Ya you are probably right. if a segmented sort is required, then we probably should just do a count_elements != count_elements(drop_list_duplicates)

revans2 added feature request New feature or request Needs Triage Need team to review and classify labels Aug 26, 2021

This was referenced Aug 26, 2021

[FEA] Make CreateMap duplicate key handling compatible with Spark and enable CreateMap by default NVIDIA/spark-rapids#3250

Closed

[FEA] map key deduplication with last one wins #9124

Closed

revans2 added the Spark Functionality that helps Spark RAPIDS label Aug 26, 2021

beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 26, 2021

ttnghia self-assigned this Sep 2, 2021

revans2 closed this as completed Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] List duplication detection #9123

[FEA] List duplication detection #9123

revans2 commented Aug 26, 2021

jrhemstad commented Sep 3, 2021

ttnghia commented Sep 3, 2021 •

edited

Loading

jrhemstad commented Sep 3, 2021

jrhemstad commented Sep 3, 2021 •

edited

Loading

ttnghia commented Sep 3, 2021

jrhemstad commented Sep 3, 2021

revans2 commented Sep 3, 2021

[FEA] List duplication detection #9123

[FEA] List duplication detection #9123

Comments

revans2 commented Aug 26, 2021

jrhemstad commented Sep 3, 2021

ttnghia commented Sep 3, 2021 • edited Loading

jrhemstad commented Sep 3, 2021

jrhemstad commented Sep 3, 2021 • edited Loading

ttnghia commented Sep 3, 2021

jrhemstad commented Sep 3, 2021

revans2 commented Sep 3, 2021

ttnghia commented Sep 3, 2021 •

edited

Loading

jrhemstad commented Sep 3, 2021 •

edited

Loading