Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move index analyzer management to FieldMapper/MapperService #63937

Merged
merged 31 commits into from
Nov 4, 2020

Conversation

romseygeek
Copy link
Contributor

Index-time analyzers are currently specified on the MappedFieldType. This
has a number of unfortunate consequences; for example, field mappers that
index data into implementation sub-fields, such as prefix or phrase
accelerators on text fields, need to expose these sub-fields as MappedFieldTypes,
which means that they then appear in field caps, are externally searchable,
etc. It also adds index-time logic to a class that should only be concerned
with search-time behaviour.

This commit removes references to the index analyzer from MappedFieldType,
and instead adds a 'registerIndexAnalyzer' method to FieldMapper; all
index-time analysis is mediated through the delegating analyzer wrapper on
MapperService. In a follow-up, this will make it possible to register
multiple field analyzers from a single FieldMapper, removing the need
for 'hidden' mapper implementations on text field, parent joins, and
elsewhere.

@romseygeek romseygeek added :Search Foundations/Mapping Index mappings, including merging and defining field types >refactoring v8.0.0 v7.11.0 labels Oct 20, 2020
@romseygeek romseygeek self-assigned this Oct 20, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Oct 20, 2020
}

public static class Builder extends ParametrizedFieldMapper.Builder {

private final Parameter<Boolean> index = Parameter.indexParam(m -> toType(m).index, true);
private final Parameter<Boolean> store = Parameter.storeParam(m -> toType(m).store, false);
private final Parameter<Boolean> index = Parameter.indexParam(m -> builder(m).index.get(), true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are due to position increment handling moving directly into TextParams.Analyzers, which simplifies a lot of the palaver around building analyzers for text fields. SearchAsYouType doesn't actually expose a position increment field so doesn't get the added benefits here, but it still requires a small refactor.

@@ -220,10 +220,8 @@ private static Analyzer buildCustomAnalyzer(AnalyzeAction.Request request, Analy
List<AnalyzeAction.AnalyzeToken> tokens = new ArrayList<>();
int lastPosition = -1;
int lastOffset = 0;
// Note that we always pass "" as the field to the various Analyzer methods, because
// the analyzers we use here are all field-specific and so ignore this parameter
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upshot of this change is to make the analyze action work in precisely the way this comment says it doesn't... if the analyzer hasn't been built specifically for this request, then we're examining the analyzer for a specific field, and so we use the general index analyzer and pass the field name so that it delegates to the correct field analyzer.

@@ -48,4 +48,20 @@ protected Analyzer getWrappedAnalyzer(String fieldName) {
// Fields need to be explicitly added
throw new IllegalArgumentException("Field [" + fieldName + "] has no associated analyzer");
}

public boolean containsBrokenAnalysis(String field) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved from the fastvector highlighter - it would be nice to actually fix this in Lucene rather than have this annoying check.

@@ -402,7 +376,7 @@ public void parse(ParseContext context) throws IOException {
}
// truncate input
if (input.length() > maxInputLength) {
int len = Math.min(maxInputLength, input.length());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an IDE-recommended change, we've just asserted that maxInputLength is smaller than input.length(), so Math.min will always return it.

@@ -400,17 +402,6 @@ private void mergeSharedOptions(FieldMapper mergeWith, List<String> conflicts) {
if (fieldType.storeTermVectorPayloads() != other.storeTermVectorPayloads()) {
conflicts.add("mapper [" + name() + "] has different [store_term_vector_payloads] values");
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing calls this anymore (all mappers that use analyzers have been parametrized) so it is safe to remove.

@@ -460,35 +451,6 @@ protected void doXContentBody(XContentBuilder builder, boolean includeDefaults,
}
}

protected final void doXContentAnalyzers(XContentBuilder builder, boolean includeDefaults) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not called from anywhere so can be removed.

}

public static MappingLookup fromMapping(Mapping mapping, Analyzer defaultIndex) {
public static MappingLookup fromMapping(Mapping mapping) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the default analyzer any more because text-based field mappers now always provide an analyzer even if they're using a default.

@@ -569,8 +550,11 @@ public Query existsQuery(QueryShardContext context) {

private static final class PhraseFieldMapper extends FieldMapper {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will be able to remove these entirely in a follow-up

@@ -80,7 +78,10 @@ protected AggregationBuilder createAggBuilderForTypeTest(MappedFieldType fieldTy
protected List<ValuesSourceType> getSupportedValuesSourceTypes() {
// TODO it is likely accidental that SigText supports anything other than Bytes, and then only text fields
return List.of(CoreValuesSourceType.NUMERIC,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a consequence of changing to checking for TextSearchInfo.NONE rather than an index analyzer. Again, it is sort of accidental that they work, and it may not make any real sense, but they do...

@romseygeek
Copy link
Contributor Author

This will need #63945 to go in first, otherwise we get random failures from SignificantText aggs on IP addresses.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of questions/comments, nothing major though, thanks for working on this!

@@ -506,6 +468,8 @@ protected static String indexOptionToString(IndexOptions indexOption) {

protected abstract String contentType();

public abstract void registerIndexAnalyzer(BiConsumer<String, Analyzer> analyzerRegistry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be package private.
Couple more thoughts: do we need a biconsumer? it seems like every mapper could expose its own analyzers through a Map<String, Analyzer>. This would though be quite a complicated API for the regular case, where a field either registers nothing or registers one analyzer with the same name as the field. Could we simplify things for those cases? I do see that you may want the method as abstract so you don't forget to implement it, but it requires so many empty implementations that I wonder if we should simplify this.

To summarize, how about something like this:

//override this if you need multiple analyzers for the same field
Map<String, Analyzer> getAnalyzers() {
    return Collections.singletonMap(name(), getFieldAnalyzer());
}

//most field mappers implement this one, possibly we could return a placeholder for the case where no analyzer is needed, not sure if we want to not make it abstract, I am undecided on that.
abstract Analyzer getFieldAnalyzer();

Two methods would have the advantage that it would be immediately traceable which mappers register multiple fields, if that matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to create a FieldMapper subclass called AnalyzedFieldMapper or similar that makes this abstract, and then have a default impl on FieldMapper itself that does nothing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could also work, but I would still find it weird that we need a biconsumer as an argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to return a Map instead.

@romseygeek
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1
@elasticmachine run elasticsearch-ci/2

@romseygeek
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample-windows

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a small comment, LGTM otherwise

@@ -161,4 +161,9 @@ protected String contentType() {
return CONTENT_TYPE;
}

@Override
public Map<String, NamedAnalyzer> indexAnalyzers() {
return Collections.singletonMap(name(), Lucene.KEYWORD_ANALYZER);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have an easier path for this common scenario where there is only one analyzer and it is registered with the same name as the field? the mapper in this case only needs to expose which analyzer it is? Having two paths is not optimal, but I think it would help highlighting which mappers are different in that they provide multiple analyzers.

@romseygeek
Copy link
Contributor Author

@javanna I've updated things to simplify implementations. You can now pass either a map of field names to analyzers, or a single analyzer, to the FieldMapper super constructor. Mappers that don't use the terms index don't pass anything and keep their existing implementations. Mappers that have a single field analyzer just pass the analyzer, and we automatically wrap it with the field name. Mappers that have subfields (currently only text and search_as_you_type but I can see others eg parent-join using this as well) collect all their analyzers into a Map and pass that.

@romseygeek romseygeek requested a review from javanna November 2, 2020 17:02
@romseygeek
Copy link
Contributor Author

@elasticmachine update branch

@romseygeek
Copy link
Contributor Author

@elasticmachine update branch

@@ -72,6 +72,7 @@ public HighlightField highlight(FieldHighlightContext fieldContext) throws IOExc
FetchSubPhase.HitContext hitContext = fieldContext.hitContext;
MappedFieldType fieldType = fieldContext.fieldType;
boolean forceSource = fieldContext.forceSource;
boolean fixBrokenAnalysis = fieldContext.context.mapperService().containsBrokenAnalysis(fieldContext.fieldName);
Copy link
Member

@javanna javanna Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is introducing another usage of FetchContext#mapperService() that we'd like to remove :( could we find a way around it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a delegator method, FetchContext#containsBrokenAnalysis()

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment about a new usage of FetchContext#mapperService() , LGTM otherwise

* backwards offsets in term vectors
*/
public boolean containsBrokenAnalysis(String field) {
return mapperService().containsBrokenAnalysis(field);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also calls mapperService() :) can you add the method to QueryShardContext and call it from there?

@romseygeek romseygeek merged commit f010269 into elastic:master Nov 4, 2020
@romseygeek romseygeek deleted the mapper/indexanalyzer branch November 4, 2020 13:53
romseygeek added a commit to romseygeek/elasticsearch that referenced this pull request Nov 4, 2020
…63937)

Index-time analyzers are currently specified on the MappedFieldType. This
has a number of unfortunate consequences; for example, field mappers that
index data into implementation sub-fields, such as prefix or phrase
accelerators on text fields, need to expose these sub-fields as MappedFieldTypes,
which means that they then appear in field caps, are externally searchable,
etc. It also adds index-time logic to a class that should only be concerned
with search-time behaviour.

This commit removes references to the index analyzer from MappedFieldType.
Instead, FieldMappers that use the terms index can pass either a single analyzer
or a Map of fields to analyzers to their super constructor, which are then
exposed via a new FieldMapper#indexAnalyzers() method; all index-time analysis
is mediated through the delegating analyzer wrapper on MapperService.
In a follow-up, this will make it possible to register multiple field analyzers from
a single FieldMapper, removing the need for 'hidden' mapper implementations
on text field, parent joins, and elsewhere.
romseygeek added a commit that referenced this pull request Nov 4, 2020
…64592)

Index-time analyzers are currently specified on the MappedFieldType. This
has a number of unfortunate consequences; for example, field mappers that
index data into implementation sub-fields, such as prefix or phrase
accelerators on text fields, need to expose these sub-fields as MappedFieldTypes,
which means that they then appear in field caps, are externally searchable,
etc. It also adds index-time logic to a class that should only be concerned
with search-time behaviour.

This commit removes references to the index analyzer from MappedFieldType.
Instead, FieldMappers that use the terms index can pass either a single analyzer
or a Map of fields to analyzers to their super constructor, which are then
exposed via a new FieldMapper#indexAnalyzers() method; all index-time analysis
is mediated through the delegating analyzer wrapper on MapperService.
In a follow-up, this will make it possible to register multiple field analyzers from
a single FieldMapper, removing the need for 'hidden' mapper implementations
on text field, parent joins, and elsewhere.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>refactoring :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants