Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable index-time sorting #24055

Merged
merged 6 commits into from
Apr 19, 2017
Merged

Enable index-time sorting #24055

merged 6 commits into from
Apr 19, 2017

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Apr 11, 2017

This change adds an index setting to define how the documents should be sorted inside each Segment.
It allows any numeric, date, boolean or keyword field inside a mapping to be used to sort the index on disk.
It is not allowed to use a nested fields inside an index that defines an index sorting since nested fields relies on the original sort of the index.
This change does not add early termination capabilities in the search layer. This will be added in a follow up.

Relates #6720

@jimczi jimczi added :Core/Infra/Core Core issues without another label >feature review v6.0.0-alpha1 labels Apr 11, 2017
@jimczi jimczi requested a review from jpountz April 11, 2017 22:17
@@ -164,6 +171,23 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
return builder;
}

static void toXContent(XContentBuilder builder, Sort sort) throws IOException {
builder.startArray(Fields.SORT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we've been moving away from these Fields objects in general and just naming the constants or even using "sort", depending on the context.

return missing;
}

final String[] fields;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why package private instead of private?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is also worth leaving a comment about how this is stored like this for easy reading from the settings. It looks funny to my java-accustomed eye.

fields = new String[0];
}
if (fields.length > 0 && indexSettings.getIndexVersionCreated().before(Version.V_6_0_0_alpha1_UNRELEASED)) {
throw new IllegalArgumentException("unsupported index.version.created:" + indexSettings.getIndexVersionCreated() +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we have gotten here? Would they need to use the test plugin to set the version? I'm not sure this is worth checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure either but this is how we would handle mixed cluster if we allow rolling upgrades for major releases ? I know it's not possible to have a mixed cluster with 5.x and 6.x nodes so maybe just paranoid statement.

fields = INDEX_SORT_FIELD_SETTING.get(settings)
.toArray(new String[0]);
} else {
fields = new String[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings.EMPTY_ARRAY might be worth using here.

throw new IllegalArgumentException("unknown index sort field:[" + fields[i] + "]");
}
boolean reverse = orders[i] == null ? false : (orders[i] == SortOrder.DESC);
MultiValueMode mode =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be easier to read as

MultiValueMode mode = modes[i];
if (mode == null) {
  mode = reverse ? MultiValueMode.MAX : MultiValueMode.MIN;
}

MergePolicy mergePolicy,
@Nullable IndexWriterFactory indexWriterFactory,
@Nullable Supplier<SequenceNumbersService> sequenceNumbersServiceSupplier,
@Nullable Sort indexSort) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the old method and put null all the places that don't use sorting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. You suggest to change all the call to createEngine with an explicit null value ? What would that change ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean add @Nullable Sort indexSort to one of the old ctors and change all the call sites that don't need a sort to provide null. Or maybe a random one? I'm not sure about that.

The `index.sort.*` settings define which fields should be used to sort the documents inside each Segment.

[WARNING]
`nested` fields uses the original sort of the Segment to work which is why they
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested fields are not compatible with index sorting because they rely on the default doc_id sorting. An error will be thrown if index sorting is activated on an index that contains nested fields.

{
"settings" : {
"index" : {
"sort.field" : ["_type", "date"], <1>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If type is going away maybe we don't want to advertise it here?

- do:
indices.create:
index: test
wait_for_active_shards: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't have this setting in these tests. If it isn't needed I'd drop it.

settings:
number_of_shards: 1
number_of_replicas: 1
index.sort.field: _type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it'd be nicer to do it on a field just so we don't rely on type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you sort on _id? That'd make the example pretty simple.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first quick pass to understand how things work. I'm wondering whether you considered configuring the index sort in the mappings rather than the settings?

builder.field("mode", ((SortedSetSortField) field).getSelector().toString());
}
builder.field("missing", field.getMissingValue());
builder.field("missing", field.getReverse());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/missing/reverse/

// The sort order is validated right after the merge of the mapping later in the process.
this.indexSortSupplier = () -> indexSettings.getIndexSortConfig().buildIndexSort(
(name) -> mapperService.fullName(name),
(ft) -> indexFieldData.getForField(ft)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use method references instead?

.toArray(FieldSortSpec[]::new);
} else {
sortSpecs = new FieldSortSpec[0];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the if/else is not needed as the code in the if block would work in all cases?

builder.field("mode", ((SortedNumericSortField) field).getSelector().toString());
} else if (field instanceof SortedSetSortField) {
builder.field("mode", ((SortedSetSortField) field).getSelector().toString());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we lowercase the modes?

IndexSortConfig::validateMissingValue, Setting.Property.IndexScope, Setting.Property.Final);

private static String validateMissingValue(String missing) {
if ("_last".equals(missing) == false && "_first".equals(missing) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not specific to that PR, but we should create constants for _first and _last

@jimczi
Copy link
Contributor Author

jimczi commented Apr 13, 2017

Thanks @jpountz and @nik9000 for reviewing.

I'm wondering whether you considered configuring the index sort in the mappings rather than the settings?

I did but currently the mapping is per type and I did not find an easy way to define something at the mapping level rather than the type level. I am not saying we should not do it but it would require some non-trivial changes in how we treat mappings. Maybe we could revisit this when we remove _type entirely ? Defining the index sort in the settings felt natural to me so I followed that path, it requires some validation between the mapping and the settings but I think the change is not that big. WDYT ?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

My previous comment about configuring the index sort in the mappings rather than in the settings is not practical. We might want to reconsider when types are gone, but for now I think settings are the way to go.

Can you please add experimental tags to this feature in the docs saying that we might change the way that the index sort is configured?


When creating a new index in elasticsearch it is possible to configure how the Segments
inside each Shard will be sorted. By default Lucene does not apply any sort and uses the
internal _doc_id_ to do the ordering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think saying that segments are ordered by doc id is a bit confusing, it rather works the other way: the ordering of documents inside a segment defines doc ids? Maybe just keep it to a minimum, eg. By default Lucene does not apply any sort..

The `index.sort.*` settings define which fields should be used to sort the documents inside each Segment.

[WARNING]
nested fields are not compatible with index sorting because they rely on the default doc_id sorting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/nested/Nested/ and maybe s/on the default doc_id sorting/on the assumption that nested documents are stored in contiguous doc ids, which can be broken by index sorting/?

<2> ... in ascending order for the `username` field and in descending order for the `date` field.


Index sorting supports the following setting:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/setting/settings/

jimczi added 6 commits April 19, 2017 13:24
This change adds an index setting to define how the documents should be sorted inside each Segment.
It allows any numeric, date, boolean or keyword field inside a mapping to be used to sort the index on
disk.
It is not allowed to use a `nested` fields inside an index that defines an index sorting since `nested` fields relies on the original sort of the index.
This change does not add early termination capabilities in the search layer. This will be added in a follow up.

Relates #6720
@jimczi jimczi merged commit f05af0a into elastic:master Apr 19, 2017
@jimczi jimczi deleted the feature/index_sorting branch April 19, 2017 12:36
@jimczi
Copy link
Contributor Author

jimczi commented Apr 19, 2017

Thanks @jpountz !

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 19, 2017
* master:
  Add BucketMetricValue interface (elastic#24188)
  Enable index-time sorting (elastic#24055)
  Clarify elasticsearch user uid:gid mapping in Docker docs
  Update field-names-field.asciidoc (elastic#24178)
  ElectMasterService.hasEnoughMasterNodes should return false if no masters were found
  Remove Ubuntu 12.04 (elastic#24161)
  [Test] Add unit tests for InternalHDRPercentilesTests (elastic#24157)
  Replicate write failures (elastic#23314)
  Rename variable in translog simple commit test
  Strengthen translog commit with open view test
  Stronger check in translog prepare and commit test
  Fix translog prepare commit and commit test
  ingest-node.asciidoc - Clarify json processor (elastic#21876)
  Painless: more testing for script_stack (elastic#24168)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants