Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sort on nanosecond date fields with missing values #74760

Merged
merged 9 commits into from
Jul 7, 2021

Conversation

cbuescher
Copy link
Member

For missing values on date fields we use Long.MIN_VALUE by default. This is okay
when the resolution of the field is milliseconds. For nanoseconds though,
negative values can lead to IllegalArgumentExpections when we try to format them
internally.
This change fixes this by explicitely setting the minimum value to 0L (which
corresponds to 1970-01-01T00:00:00.000 for nanosecond resolution) when no other
explicit missing value is defined and the target numeric type is a nanosecond
type (this is true for nanosecond fields and when "numeric_type" is explicitely
set). This way we correct the behaviour for single typed indices and cases where
we are sorting across multiple indices with mixed "date" and "date_nanos" type
where "numeric_type" is set in the sort definition.

Closes #73763

For missing values on date fields we use Long.MIN_VALUE by default. This is okay
when the resolution of the field is milliseconds. For nanoseconds though,
negative values can lead to IllegalArgumentExpections when we try to format them
internally.
This change fixes this by explicitely setting the minimum value to 0L (which
corresponds to 1970-01-01T00:00:00.000 for nanosecond resolution) when no other
explicit missing value is defined and the target numeric type is a nanosecond
type (this is true for nanosecond fields and when "numeric_type" is explicitely
set). This way we correct the behaviour for single typed indices and cases where
we are sorting across multiple indices with mixed "date" and "date_nanos" type
where "numeric_type" is set in the sort definition.

Closes elastic#73763
@cbuescher cbuescher added >bug :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.14.0 labels Jun 30, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jun 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@cbuescher
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

final boolean min = sortMissingFirst(missingValue) ^ reverse;
missingValue = min ? 0L : Long.MAX_VALUE;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic could move to IndexNumericFieldData#missingObject ? IndexFieldData#missingObject doesn't need to be final, we need to pick the value depending on the targetNumericType.

Copy link
Member Author

@cbuescher cbuescher Jul 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a bit and one problem I couldn't solve with this is that "missingObject" is part of XFieldComparatorSource, not IndexFieldData. That probably woudl mean overwriting it in special versions of LongValuesComparatorSource, and for the things I tried it wasn't always clear how to ther the "targetNumericType" from the sort (e.g. reduceType() returns SortField.Type that doesn't even differentiate between Long, Data and Datenanos like IndexNumericFieldData.NumericType does). Maybe I'm missing something but from what I tried this current location seems less complex to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The targetNumericType could be added to the LongValuesSource and checked in missingObject ? It's not really about complexity here, missingObject is called in several contexts so it needs a fix.

SearchResponse searchResponse = client().prepareSearch(index)
.addSort(SortBuilders.fieldSort("mydate").order(SortOrder.ASC).setFormat(format).setNumericType("date_nanos"))
.get();
assertHitsInOrder(searchResponse, new String[] { "1", "2", "4", "5", "3", "6" });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tiebreaker should be used to ensure that 3 is always sorted before 6 (same value otherwise).

.setNumericType("date_nanos")
)
.get();
assertHitsInOrder(searchResponse, new String[] { "3", "6", "1", "2", "4", "5" });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@cbuescher
Copy link
Member Author

@jimczi thanks for the review, I improved the tests and left a comment regarding moving the implementation logic elsewhere and the problems I encountered. If you have other suggestions please let me know, otherwise I hope its also okay if we leave it where it is.

@cbuescher
Copy link
Member Author

@jimczi I pushed another commit that adresses your latest comment, should be ready for another look I think

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments, LGTM otherwise

@Nullable Object missingValue,
MultiValueMode sortMode,
Nested nested,
NumericType targetNumericType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The targetNumericType is implicit here ?

@Nullable Object missingValue,
MultiValueMode sortMode,
Nested nested,
NumericType targetNumericType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The targetNumericType is implicit here, no need to add it.

// special case to prevent negative values that would cause invalid nanosecond ranges
if (sortMissingFirst(missingValue) || sortMissingLast(missingValue)) {
final boolean min = sortMissingFirst(missingValue) ^ reversed;
return min ? 0L : Long.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use DateUtils#MAX_NANOSECOND_INSTANT as the max

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the converted nanoseconds value of that Instant though I think, which should be the same as Long.MAX_VALUE? I introduced another constant (8c6df48#diff-09ba5f62fc8de8923f01213874ef48657920b2cc1428ff4f27f5f06169ee8f1aR207) in my last commit to not have to do that coversion every time, just if you want to check.

@cbuescher
Copy link
Member Author

@elasticmachine update branch

@cbuescher cbuescher merged commit bf952a4 into elastic:master Jul 7, 2021
@cbuescher cbuescher added backport pending auto-backport Automatically create backport pull requests when merged labels Jul 7, 2021
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Jul 7, 2021
For missing values on date fields we use Long.MIN_VALUE by default. This is okay
when the resolution of the field is milliseconds. For nanoseconds though,
negative values can lead to IllegalArgumentExpections when we try to format them
internally.
This change fixes this by explicitely setting the minimum value to 0L (which
corresponds to 1970-01-01T00:00:00.000 for nanosecond resolution) when no other
explicit missing value is defined and the target numeric type is a nanosecond
type (this is true for nanosecond fields and when "numeric_type" is explicitely
set). This way we correct the behaviour for single typed indices and cases where
we are sorting across multiple indices with mixed "date" and "date_nanos" type
where "numeric_type" is set in the sort definition.

Closes elastic#73763
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Jul 7, 2021
For missing values on date fields we use Long.MIN_VALUE by default. This is okay
when the resolution of the field is milliseconds. For nanoseconds though,
negative values can lead to IllegalArgumentExpections when we try to format them
internally.
This change fixes this by explicitely setting the minimum value to 0L (which
corresponds to 1970-01-01T00:00:00.000 for nanosecond resolution) when no other
explicit missing value is defined and the target numeric type is a nanosecond
type (this is true for nanosecond fields and when "numeric_type" is explicitely
set). This way we correct the behaviour for single typed indices and cases where
we are sorting across multiple indices with mixed "date" and "date_nanos" type
where "numeric_type" is set in the sort definition.

Closes elastic#73763
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
7.14
7.x

cbuescher pushed a commit that referenced this pull request Jul 7, 2021
#75064)

For missing values on date fields we use Long.MIN_VALUE by default. This is okay
when the resolution of the field is milliseconds. For nanoseconds though,
negative values can lead to IllegalArgumentExpections when we try to format them
internally.
This change fixes this by explicitely setting the minimum value to 0L (which
corresponds to 1970-01-01T00:00:00.000 for nanosecond resolution) when no other
explicit missing value is defined and the target numeric type is a nanosecond
type (this is true for nanosecond fields and when "numeric_type" is explicitely
set). This way we correct the behaviour for single typed indices and cases where
we are sorting across multiple indices with mixed "date" and "date_nanos" type
where "numeric_type" is set in the sort definition.

Closes #73763
@cbuescher cbuescher added v7.14.0 and removed v7.14.1 labels Jul 7, 2021
cbuescher pushed a commit that referenced this pull request Jul 7, 2021
…#75065)

For missing values on date fields we use Long.MIN_VALUE by default. This is okay
when the resolution of the field is milliseconds. For nanoseconds though,
negative values can lead to IllegalArgumentExpections when we try to format them
internally.
This change fixes this by explicitely setting the minimum value to 0L (which
corresponds to 1970-01-01T00:00:00.000 for nanosecond resolution) when no other
explicit missing value is defined and the target numeric type is a nanosecond
type (this is true for nanosecond fields and when "numeric_type" is explicitely
set). This way we correct the behaviour for single typed indices and cases where
we are sorting across multiple indices with mixed "date" and "date_nanos" type
where "numeric_type" is set in the sort definition.

Closes #73763
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.14.0 v7.15.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sorting descending on datefield with nanosecond-precision fails for missing values
7 participants