-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] List duplication detection #9123
Comments
You could do a I admit that's a little expensive, but the proposed |
Well, for efficiency (in term of performance) I'm going to just do a segmented sort then linear scan of each adjacent list pair. |
Wouldn't you need to compare the second element to the other entries, and the third, etc? Furthermore, this would still necessitate adding a new function to libcudf for a very specific use case that does not seem to generalize well. That increases compile time, binary size, interface surface area, maintenance cost, etc. I would prefer we attempt to use existing APIs to achieve the desired end and only if benchmarking shows that approach to be a problem would we revisit trying to find a more efficient and general purpose primitive. |
I'm curious when/how this situation could arise. The map column can only come from reading a file (which I would assume would have error checked the duplicity when it was written?) or from the result of some other operation that would have preserved the uniqueness of keys, right? |
Sorry I edited my post to something else. Fuzzy brain in the morning after waking up before ifood. |
If the proposed implementation requires a segmented sort anyways, I'm less concerned about the overhead of the |
Ya you are probably right. if a segmented sort is required, then we probably should just do a |
Is your feature request related to a problem? Please describe.
In Spark we have a requirement to be able to detect and fail if there are duplicate keys in a map. We store maps as a list of Struct(key, value). It is fairly simple to pull the keys out of this as a list so really what we need is the ability to tell if there are duplicate values in any list in a column.
Describe the solution you'd like
I am not 100% sure of the best solution here. For flexibility it might be nice to break the problem down into a list_contains_duplicates method that returns a boolean for each list that has a duplicate in it, and then we can do an any reduction on it to get the final answer. But I don't know if anyone else has a similar requirement.
Describe alternatives you've considered
We could do a segmented sort of the values. Then pull out just the data column and try to do a windowed lead 1 (but I am not sure we can do that because we have to have a way to use the offsets to set the window boundaries). With that we can check for equality between the two entries and finally do an any reduction (ignoring nulls). That sounds overly complicated and I am not 100% sure we can even do that without help from cudf.
Additional context
The behavior of what to do on duplicate keys is configurable, but this is the default so it is the most important for us to implement. I will file a separate issue to support last value wins.
The text was updated successfully, but these errors were encountered: