-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move index analyzer management to FieldMapper/MapperService #63937
Conversation
Pinging @elastic/es-search (:Search/Mapping) |
} | ||
|
||
public static class Builder extends ParametrizedFieldMapper.Builder { | ||
|
||
private final Parameter<Boolean> index = Parameter.indexParam(m -> toType(m).index, true); | ||
private final Parameter<Boolean> store = Parameter.storeParam(m -> toType(m).store, false); | ||
private final Parameter<Boolean> index = Parameter.indexParam(m -> builder(m).index.get(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are due to position increment handling moving directly into TextParams.Analyzers
, which simplifies a lot of the palaver around building analyzers for text fields. SearchAsYouType doesn't actually expose a position increment field so doesn't get the added benefits here, but it still requires a small refactor.
@@ -220,10 +220,8 @@ private static Analyzer buildCustomAnalyzer(AnalyzeAction.Request request, Analy | |||
List<AnalyzeAction.AnalyzeToken> tokens = new ArrayList<>(); | |||
int lastPosition = -1; | |||
int lastOffset = 0; | |||
// Note that we always pass "" as the field to the various Analyzer methods, because | |||
// the analyzers we use here are all field-specific and so ignore this parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upshot of this change is to make the analyze action work in precisely the way this comment says it doesn't... if the analyzer hasn't been built specifically for this request, then we're examining the analyzer for a specific field, and so we use the general index analyzer and pass the field name so that it delegates to the correct field analyzer.
@@ -48,4 +48,20 @@ protected Analyzer getWrappedAnalyzer(String fieldName) { | |||
// Fields need to be explicitly added | |||
throw new IllegalArgumentException("Field [" + fieldName + "] has no associated analyzer"); | |||
} | |||
|
|||
public boolean containsBrokenAnalysis(String field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved from the fastvector highlighter - it would be nice to actually fix this in Lucene rather than have this annoying check.
@@ -402,7 +376,7 @@ public void parse(ParseContext context) throws IOException { | |||
} | |||
// truncate input | |||
if (input.length() > maxInputLength) { | |||
int len = Math.min(maxInputLength, input.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an IDE-recommended change, we've just asserted that maxInputLength is smaller than input.length(), so Math.min will always return it.
@@ -400,17 +402,6 @@ private void mergeSharedOptions(FieldMapper mergeWith, List<String> conflicts) { | |||
if (fieldType.storeTermVectorPayloads() != other.storeTermVectorPayloads()) { | |||
conflicts.add("mapper [" + name() + "] has different [store_term_vector_payloads] values"); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing calls this anymore (all mappers that use analyzers have been parametrized) so it is safe to remove.
@@ -460,35 +451,6 @@ protected void doXContentBody(XContentBuilder builder, boolean includeDefaults, | |||
} | |||
} | |||
|
|||
protected final void doXContentAnalyzers(XContentBuilder builder, boolean includeDefaults) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not called from anywhere so can be removed.
} | ||
|
||
public static MappingLookup fromMapping(Mapping mapping, Analyzer defaultIndex) { | ||
public static MappingLookup fromMapping(Mapping mapping) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the default analyzer any more because text-based field mappers now always provide an analyzer even if they're using a default.
@@ -569,8 +550,11 @@ public Query existsQuery(QueryShardContext context) { | |||
|
|||
private static final class PhraseFieldMapper extends FieldMapper { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will be able to remove these entirely in a follow-up
@@ -80,7 +78,10 @@ protected AggregationBuilder createAggBuilderForTypeTest(MappedFieldType fieldTy | |||
protected List<ValuesSourceType> getSupportedValuesSourceTypes() { | |||
// TODO it is likely accidental that SigText supports anything other than Bytes, and then only text fields | |||
return List.of(CoreValuesSourceType.NUMERIC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are a consequence of changing to checking for TextSearchInfo.NONE rather than an index analyzer. Again, it is sort of accidental that they work, and it may not make any real sense, but they do...
This will need #63945 to go in first, otherwise we get random failures from SignificantText aggs on IP addresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of questions/comments, nothing major though, thanks for working on this!
@@ -506,6 +468,8 @@ protected static String indexOptionToString(IndexOptions indexOption) { | |||
|
|||
protected abstract String contentType(); | |||
|
|||
public abstract void registerIndexAnalyzer(BiConsumer<String, Analyzer> analyzerRegistry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be package private.
Couple more thoughts: do we need a biconsumer? it seems like every mapper could expose its own analyzers through a Map<String, Analyzer>. This would though be quite a complicated API for the regular case, where a field either registers nothing or registers one analyzer with the same name as the field. Could we simplify things for those cases? I do see that you may want the method as abstract so you don't forget to implement it, but it requires so many empty implementations that I wonder if we should simplify this.
To summarize, how about something like this:
//override this if you need multiple analyzers for the same field
Map<String, Analyzer> getAnalyzers() {
return Collections.singletonMap(name(), getFieldAnalyzer());
}
//most field mappers implement this one, possibly we could return a placeholder for the case where no analyzer is needed, not sure if we want to not make it abstract, I am undecided on that.
abstract Analyzer getFieldAnalyzer();
Two methods would have the advantage that it would be immediately traceable which mappers register multiple fields, if that matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to create a FieldMapper subclass called AnalyzedFieldMapper or similar that makes this abstract, and then have a default impl on FieldMapper itself that does nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could also work, but I would still find it weird that we need a biconsumer as an argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to return a Map instead.
@elasticmachine run elasticsearch-ci/1 |
@elasticmachine run elasticsearch-ci/packaging-sample-windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a small comment, LGTM otherwise
@@ -161,4 +161,9 @@ protected String contentType() { | |||
return CONTENT_TYPE; | |||
} | |||
|
|||
@Override | |||
public Map<String, NamedAnalyzer> indexAnalyzers() { | |||
return Collections.singletonMap(name(), Lucene.KEYWORD_ANALYZER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have an easier path for this common scenario where there is only one analyzer and it is registered with the same name as the field? the mapper in this case only needs to expose which analyzer it is? Having two paths is not optimal, but I think it would help highlighting which mappers are different in that they provide multiple analyzers.
@javanna I've updated things to simplify implementations. You can now pass either a map of field names to analyzers, or a single analyzer, to the FieldMapper super constructor. Mappers that don't use the terms index don't pass anything and keep their existing implementations. Mappers that have a single field analyzer just pass the analyzer, and we automatically wrap it with the field name. Mappers that have subfields (currently only |
@elasticmachine update branch |
@elasticmachine update branch |
@@ -72,6 +72,7 @@ public HighlightField highlight(FieldHighlightContext fieldContext) throws IOExc | |||
FetchSubPhase.HitContext hitContext = fieldContext.hitContext; | |||
MappedFieldType fieldType = fieldContext.fieldType; | |||
boolean forceSource = fieldContext.forceSource; | |||
boolean fixBrokenAnalysis = fieldContext.context.mapperService().containsBrokenAnalysis(fieldContext.fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is introducing another usage of FetchContext#mapperService() that we'd like to remove :( could we find a way around it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a delegator method, FetchContext#containsBrokenAnalysis()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one comment about a new usage of FetchContext#mapperService() , LGTM otherwise
* backwards offsets in term vectors | ||
*/ | ||
public boolean containsBrokenAnalysis(String field) { | ||
return mapperService().containsBrokenAnalysis(field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also calls mapperService() :) can you add the method to QueryShardContext and call it from there?
…63937) Index-time analyzers are currently specified on the MappedFieldType. This has a number of unfortunate consequences; for example, field mappers that index data into implementation sub-fields, such as prefix or phrase accelerators on text fields, need to expose these sub-fields as MappedFieldTypes, which means that they then appear in field caps, are externally searchable, etc. It also adds index-time logic to a class that should only be concerned with search-time behaviour. This commit removes references to the index analyzer from MappedFieldType. Instead, FieldMappers that use the terms index can pass either a single analyzer or a Map of fields to analyzers to their super constructor, which are then exposed via a new FieldMapper#indexAnalyzers() method; all index-time analysis is mediated through the delegating analyzer wrapper on MapperService. In a follow-up, this will make it possible to register multiple field analyzers from a single FieldMapper, removing the need for 'hidden' mapper implementations on text field, parent joins, and elsewhere.
…64592) Index-time analyzers are currently specified on the MappedFieldType. This has a number of unfortunate consequences; for example, field mappers that index data into implementation sub-fields, such as prefix or phrase accelerators on text fields, need to expose these sub-fields as MappedFieldTypes, which means that they then appear in field caps, are externally searchable, etc. It also adds index-time logic to a class that should only be concerned with search-time behaviour. This commit removes references to the index analyzer from MappedFieldType. Instead, FieldMappers that use the terms index can pass either a single analyzer or a Map of fields to analyzers to their super constructor, which are then exposed via a new FieldMapper#indexAnalyzers() method; all index-time analysis is mediated through the delegating analyzer wrapper on MapperService. In a follow-up, this will make it possible to register multiple field analyzers from a single FieldMapper, removing the need for 'hidden' mapper implementations on text field, parent joins, and elsewhere.
Index-time analyzers are currently specified on the MappedFieldType. This
has a number of unfortunate consequences; for example, field mappers that
index data into implementation sub-fields, such as prefix or phrase
accelerators on text fields, need to expose these sub-fields as MappedFieldTypes,
which means that they then appear in field caps, are externally searchable,
etc. It also adds index-time logic to a class that should only be concerned
with search-time behaviour.
This commit removes references to the index analyzer from MappedFieldType,
and instead adds a 'registerIndexAnalyzer' method to FieldMapper; all
index-time analysis is mediated through the delegating analyzer wrapper on
MapperService. In a follow-up, this will make it possible to register
multiple field analyzers from a single FieldMapper, removing the need
for 'hidden' mapper implementations on text field, parent joins, and
elsewhere.