Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite CollectionUtils dedup to work with any type #85352

Merged
merged 14 commits into from
Mar 28, 2022

Conversation

rjernst
Copy link
Member

@rjernst rjernst commented Mar 25, 2022

CollectionUtils contains a method for sorting and deduplicating an hppc
list of byte arrays. That is a very specific type. Yet the algorithm for
deduplicating a sorted list is very simple and does not need to be
specially typed.

This commit removes the sort portion of the method, as that is already
easily available (and timsort in Java should be just fine for this
purpose, we don't need introsort), renames to unique, and makes the
method take a generic List along with a Comparator.

relates #84735

CollectionUtils contains a method for sorting and deduplicating an hppc
list of byte arrays. That is a very specific type. Yet the algorithm for
deduplicating a sorted list is very simple and does not need to be
specially typed.

This commit removes the sort portion of the method, as that is already
easily available (and timsort in Java should be just fine for this
purpose, we don't need introsort), renames to unique, and makes the
method take a generic List along with a Comparator.

relates elastic#84735
@rjernst rjernst added :Core/Infra/Core Core issues without another label >refactoring v8.2.0 labels Mar 25, 2022
@rjernst rjernst requested a review from jpountz March 25, 2022 04:35
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Mar 25, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much cleaner.

T prevValue = list.get(0);
for (int i = 1; i < list.size(); ++i) {
T nextValue = list.get(i);
if (cmp.compare(nextValue, prevValue) != 0 && prevNdx++ != i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pre-increment rather than post-increment? Otherwise it looks to me like a list where all elements are unique would still overwrite all the time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it should be pre-increment.

@@ -47,6 +44,27 @@ public static boolean isEmpty(Object[] array) {
return array == null || array.length == 0;
}

public static <T> void unique(List<T> list, Comparator<T> cmp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadocs that the list must be sorted according to the given comparator?
Also a bit of a nit-pick, but since this modifies the list in-place, I feel like naming the method after a verb would be more appropriate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to uniquify

if (list.size() <= 1) {
return;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify that the list implements random access?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm only requires a forward iterator. I've rewritten to use ListIterator instead of indices. The only caveat is that for LinkedList Java does not provide an efficient means to remove the rest of a list from a given point.

@rjernst
Copy link
Member Author

rjernst commented Mar 28, 2022

@jpountz I think I addressed all your comments, can you take another look?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks good to me. One thing that makes me a bit nervous is that I'm unsure of the performance impact that this change would have for users of binary fields, but it would only possibly affect binary fields that are effectively multi-valued, which is a rare combination I suspect, so probably nothing to worry about?

@rjernst
Copy link
Member Author

rjernst commented Mar 28, 2022

but it would only possibly affect binary fields that are effectively multi-valued, which is a rare combination I suspect, so probably nothing to worry about?

Yes that is my thinking.

rjernst and others added 7 commits March 28, 2022 12:51
When a plugin specifies an elasticsearch version with an empty string
in their plugin-descriptor.properties file, we don't enforce the
version check. This PR ensures that we check the version string for
empty as well as null value.
Since Java 9, the JDK has provided a means of parsing Java versions and
getting the current Java version. That class obviates the need for the
JavaVersion class of Elasticsearch. This commit removes the JavaVersion
class in favor of Runtime.Version.

Note that most of the changes here simply removed logic around
versioning because this change is intended only for the master branch,
where Java 17 is required.
This commit removes leftover sort methods in CollectionUtils which were
only used in tests.
@rjernst rjernst merged commit 31918fb into elastic:master Mar 28, 2022
@rjernst rjernst deleted the hppc/binary_dedup branch March 28, 2022 20:57
@rjernst rjernst mentioned this pull request Mar 28, 2022
43 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >refactoring Team:Core/Infra Meta label for core/infra team v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants