Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] List duplication detection #9123

Closed
revans2 opened this issue Aug 26, 2021 · 7 comments
Closed

[FEA] List duplication detection #9123

revans2 opened this issue Aug 26, 2021 · 7 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 26, 2021

Is your feature request related to a problem? Please describe.
In Spark we have a requirement to be able to detect and fail if there are duplicate keys in a map. We store maps as a list of Struct(key, value). It is fairly simple to pull the keys out of this as a list so really what we need is the ability to tell if there are duplicate values in any list in a column.

Describe the solution you'd like
I am not 100% sure of the best solution here. For flexibility it might be nice to break the problem down into a list_contains_duplicates method that returns a boolean for each list that has a duplicate in it, and then we can do an any reduction on it to get the final answer. But I don't know if anyone else has a similar requirement.

Describe alternatives you've considered
We could do a segmented sort of the values. Then pull out just the data column and try to do a windowed lead 1 (but I am not sure we can do that because we have to have a way to use the offsets to set the window boundaries). With that we can check for equality between the two entries and finally do an any reduction (ignoring nulls). That sounds overly complicated and I am not 100% sure we can even do that without help from cudf.

Additional context
The behavior of what to do on duplicate keys is configurable, but this is the default so it is the most important for us to implement. I will file a separate issue to support last value wins.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify labels Aug 26, 2021
@revans2 revans2 added the Spark Functionality that helps Spark RAPIDS label Aug 26, 2021
@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 26, 2021
@ttnghia ttnghia self-assigned this Sep 2, 2021
@jrhemstad
Copy link
Contributor

You could do a count_elements + drop_list_duplicates + count_elements and then check if the results of the two count_elements are different.

I admit that's a little expensive, but the proposed list_contains_duplicates feels a little ad hoc.

@ttnghia
Copy link
Contributor

ttnghia commented Sep 3, 2021

Well, for efficiency (in term of performance) I'm going to just do a segmented sort then linear scan of each adjacent list pair.

@jrhemstad
Copy link
Contributor

Well, for efficiency (in term of performance) I'm going to just do a linear scan of each list and compare the first entry to the other entries.

Wouldn't you need to compare the second element to the other entries, and the third, etc?

Furthermore, this would still necessitate adding a new function to libcudf for a very specific use case that does not seem to generalize well. That increases compile time, binary size, interface surface area, maintenance cost, etc. I would prefer we attempt to use existing APIs to achieve the desired end and only if benchmarking shows that approach to be a problem would we revisit trying to find a more efficient and general purpose primitive.

@jrhemstad
Copy link
Contributor

jrhemstad commented Sep 3, 2021

we have a requirement to be able to detect and fail if there are duplicate keys in a map

I'm curious when/how this situation could arise. The map column can only come from reading a file (which I would assume would have error checked the duplicity when it was written?) or from the result of some other operation that would have preserved the uniqueness of keys, right?

@ttnghia
Copy link
Contributor

ttnghia commented Sep 3, 2021

Well, for efficiency (in term of performance) I'm going to just do a linear scan of each list and compare the first entry to the other entries.

Wouldn't you need to compare the second element to the other entries, and the third, etc?

Sorry I edited my post to something else. Fuzzy brain in the morning after waking up before ifood.

@jrhemstad
Copy link
Contributor

If the proposed implementation requires a segmented sort anyways, I'm less concerned about the overhead of the count_elements/drop_list_duplicates solution I mentioned above and we should try that first.

@revans2
Copy link
Contributor Author

revans2 commented Sep 3, 2021

Ya you are probably right. if a segmented sort is required, then we probably should just do a count_elements != count_elements(drop_list_duplicates)

@revans2 revans2 closed this as completed Sep 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

4 participants