Move index analyzer management to FieldMapper/MapperService #63937

romseygeek · 2020-10-20T13:12:04Z

Index-time analyzers are currently specified on the MappedFieldType. This
has a number of unfortunate consequences; for example, field mappers that
index data into implementation sub-fields, such as prefix or phrase
accelerators on text fields, need to expose these sub-fields as MappedFieldTypes,
which means that they then appear in field caps, are externally searchable,
etc. It also adds index-time logic to a class that should only be concerned
with search-time behaviour.

This commit removes references to the index analyzer from MappedFieldType,
and instead adds a 'registerIndexAnalyzer' method to FieldMapper; all
index-time analysis is mediated through the delegating analyzer wrapper on
MapperService. In a follow-up, this will make it possible to register
multiple field analyzers from a single FieldMapper, removing the need
for 'hidden' mapper implementations on text field, parent joins, and
elsewhere.

elasticmachine · 2020-10-20T13:12:06Z

Pinging @elastic/es-search (:Search/Mapping)

romseygeek · 2020-10-20T13:14:23Z

...s/mapper-extras/src/main/java/org/elasticsearch/index/mapper/SearchAsYouTypeFieldMapper.java

    }

    public static class Builder extends ParametrizedFieldMapper.Builder {

-        private final Parameter<Boolean> index = Parameter.indexParam(m -> toType(m).index, true);
-        private final Parameter<Boolean> store = Parameter.storeParam(m -> toType(m).store, false);
+        private final Parameter<Boolean> index = Parameter.indexParam(m -> builder(m).index.get(), true);


These changes are due to position increment handling moving directly into TextParams.Analyzers, which simplifies a lot of the palaver around building analyzers for text fields. SearchAsYouType doesn't actually expose a position increment field so doesn't get the added benefits here, but it still requires a small refactor.

romseygeek · 2020-10-20T13:19:40Z

server/src/main/java/org/elasticsearch/action/admin/indices/analyze/TransportAnalyzeAction.java

@@ -220,10 +220,8 @@ private static Analyzer buildCustomAnalyzer(AnalyzeAction.Request request, Analy
        List<AnalyzeAction.AnalyzeToken> tokens = new ArrayList<>();
        int lastPosition = -1;
        int lastOffset = 0;
-        // Note that we always pass "" as the field to the various Analyzer methods, because
-        // the analyzers we use here are all field-specific and so ignore this parameter


The upshot of this change is to make the analyze action work in precisely the way this comment says it doesn't... if the analyzer hasn't been built specifically for this request, then we're examining the analyzer for a specific field, and so we use the general index analyzer and pass the field name so that it delegates to the correct field analyzer.

romseygeek · 2020-10-20T13:20:47Z

server/src/main/java/org/elasticsearch/index/analysis/FieldNameAnalyzer.java

@@ -48,4 +48,20 @@ protected Analyzer getWrappedAnalyzer(String fieldName) {
        // Fields need to be explicitly added
        throw new IllegalArgumentException("Field [" + fieldName + "] has no associated analyzer");
    }
+
+    public boolean containsBrokenAnalysis(String field) {


This is moved from the fastvector highlighter - it would be nice to actually fix this in Lucene rather than have this annoying check.

romseygeek · 2020-10-20T13:22:29Z

server/src/main/java/org/elasticsearch/index/mapper/CompletionFieldMapper.java

@@ -402,7 +376,7 @@ public void parse(ParseContext context) throws IOException {
            }
            // truncate input
            if (input.length() > maxInputLength) {
-                int len = Math.min(maxInputLength, input.length());


This was an IDE-recommended change, we've just asserted that maxInputLength is smaller than input.length(), so Math.min will always return it.

romseygeek · 2020-10-20T13:23:17Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

@@ -400,17 +402,6 @@ private void mergeSharedOptions(FieldMapper mergeWith, List<String> conflicts) {
        if (fieldType.storeTermVectorPayloads() != other.storeTermVectorPayloads()) {
            conflicts.add("mapper [" + name() + "] has different [store_term_vector_payloads] values");
        }
-


Nothing calls this anymore (all mappers that use analyzers have been parametrized) so it is safe to remove.

romseygeek · 2020-10-20T13:23:32Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

@@ -460,35 +451,6 @@ protected void doXContentBody(XContentBuilder builder, boolean includeDefaults,
        }
    }

-    protected final void doXContentAnalyzers(XContentBuilder builder, boolean includeDefaults) throws IOException {


This is not called from anywhere so can be removed.

romseygeek · 2020-10-20T13:25:23Z

server/src/main/java/org/elasticsearch/index/mapper/MappingLookup.java

-    }
-
-    public static MappingLookup fromMapping(Mapping mapping, Analyzer defaultIndex) {
+    public static MappingLookup fromMapping(Mapping mapping) {


We don't need the default analyzer any more because text-based field mappers now always provide an analyzer even if they're using a default.

romseygeek · 2020-10-20T13:26:44Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -569,8 +550,11 @@ public Query existsQuery(QueryShardContext context) {

    private static final class PhraseFieldMapper extends FieldMapper {



We will be able to remove these entirely in a follow-up

romseygeek · 2020-10-20T13:30:04Z

.../java/org/elasticsearch/search/aggregations/bucket/terms/SignificantTextAggregatorTests.java

@@ -80,7 +78,10 @@ protected AggregationBuilder createAggBuilderForTypeTest(MappedFieldType fieldTy
    protected List<ValuesSourceType> getSupportedValuesSourceTypes() {
        // TODO it is likely accidental that SigText supports anything other than Bytes, and then only text fields
        return List.of(CoreValuesSourceType.NUMERIC,


These are a consequence of changing to checking for TextSearchInfo.NONE rather than an index analyzer. Again, it is sort of accidental that they work, and it may not make any real sense, but they do...

romseygeek · 2020-10-21T08:43:36Z

This will need #63945 to go in first, otherwise we get random failures from SignificantText aggs on IP addresses.

javanna

I left a couple of questions/comments, nothing major though, thanks for working on this!

server/src/main/java/org/elasticsearch/index/mapper/MapperService.java

javanna · 2020-10-21T10:58:18Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

@@ -506,6 +468,8 @@ protected static String indexOptionToString(IndexOptions indexOption) {

    protected abstract String contentType();

+    public abstract void registerIndexAnalyzer(BiConsumer<String, Analyzer> analyzerRegistry);


I think this could be package private.
Couple more thoughts: do we need a biconsumer? it seems like every mapper could expose its own analyzers through a Map<String, Analyzer>. This would though be quite a complicated API for the regular case, where a field either registers nothing or registers one analyzer with the same name as the field. Could we simplify things for those cases? I do see that you may want the method as abstract so you don't forget to implement it, but it requires so many empty implementations that I wonder if we should simplify this.

To summarize, how about something like this:

//override this if you need multiple analyzers for the same field Map<String, Analyzer> getAnalyzers() { return Collections.singletonMap(name(), getFieldAnalyzer()); } //most field mappers implement this one, possibly we could return a placeholder for the case where no analyzer is needed, not sure if we want to not make it abstract, I am undecided on that. abstract Analyzer getFieldAnalyzer();

Two methods would have the advantage that it would be immediately traceable which mappers register multiple fields, if that matters.

Another option would be to create a FieldMapper subclass called AnalyzedFieldMapper or similar that makes this abstract, and then have a default impl on FieldMapper itself that does nothing?

That could also work, but I would still find it weird that we need a biconsumer as an argument

I've changed this to return a Map instead.

server/src/main/java/org/elasticsearch/index/mapper/MapperService.java

romseygeek · 2020-10-29T14:22:31Z

@elasticmachine run elasticsearch-ci/1
@elasticmachine run elasticsearch-ci/2

romseygeek · 2020-10-29T16:12:33Z

@elasticmachine run elasticsearch-ci/packaging-sample-windows

javanna

I left a small comment, LGTM otherwise

javanna · 2020-10-30T08:35:52Z

modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/RankFeaturesFieldMapper.java

@@ -161,4 +161,9 @@ protected String contentType() {
        return CONTENT_TYPE;
    }

+    @Override
+    public Map<String, NamedAnalyzer> indexAnalyzers() {
+        return Collections.singletonMap(name(), Lucene.KEYWORD_ANALYZER);


should we have an easier path for this common scenario where there is only one analyzer and it is registered with the same name as the field? the mapper in this case only needs to expose which analyzer it is? Having two paths is not optimal, but I think it would help highlighting which mappers are different in that they provide multiple analyzers.

server/src/main/java/org/elasticsearch/index/mapper/MapperService.java

romseygeek · 2020-11-02T17:01:52Z

@javanna I've updated things to simplify implementations. You can now pass either a map of field names to analyzers, or a single analyzer, to the FieldMapper super constructor. Mappers that don't use the terms index don't pass anything and keep their existing implementations. Mappers that have a single field analyzer just pass the analyzer, and we automatically wrap it with the field name. Mappers that have subfields (currently only text and search_as_you_type but I can see others eg parent-join using this as well) collect all their analyzers into a Map and pass that.

romseygeek · 2020-11-03T18:10:14Z

@elasticmachine update branch

romseygeek · 2020-11-04T09:14:44Z

@elasticmachine update branch

javanna · 2020-11-04T10:32:59Z

...r/src/main/java/org/elasticsearch/search/fetch/subphase/highlight/FastVectorHighlighter.java

@@ -72,6 +72,7 @@ public HighlightField highlight(FieldHighlightContext fieldContext) throws IOExc
        FetchSubPhase.HitContext hitContext = fieldContext.hitContext;
        MappedFieldType fieldType = fieldContext.fieldType;
        boolean forceSource = fieldContext.forceSource;
+        boolean fixBrokenAnalysis = fieldContext.context.mapperService().containsBrokenAnalysis(fieldContext.fieldName);


this is introducing another usage of FetchContext#mapperService() that we'd like to remove :( could we find a way around it?

I added a delegator method, FetchContext#containsBrokenAnalysis()

javanna

I left one comment about a new usage of FetchContext#mapperService() , LGTM otherwise

…apper/indexanalyzer

javanna · 2020-11-04T12:01:44Z

server/src/main/java/org/elasticsearch/search/fetch/FetchContext.java

+     * backwards offsets in term vectors
+     */
+    public boolean containsBrokenAnalysis(String field) {
+        return mapperService().containsBrokenAnalysis(field);


this also calls mapperService() :) can you add the method to QueryShardContext and call it from there?

…63937) Index-time analyzers are currently specified on the MappedFieldType. This has a number of unfortunate consequences; for example, field mappers that index data into implementation sub-fields, such as prefix or phrase accelerators on text fields, need to expose these sub-fields as MappedFieldTypes, which means that they then appear in field caps, are externally searchable, etc. It also adds index-time logic to a class that should only be concerned with search-time behaviour. This commit removes references to the index analyzer from MappedFieldType. Instead, FieldMappers that use the terms index can pass either a single analyzer or a Map of fields to analyzers to their super constructor, which are then exposed via a new FieldMapper#indexAnalyzers() method; all index-time analysis is mediated through the delegating analyzer wrapper on MapperService. In a follow-up, this will make it possible to register multiple field analyzers from a single FieldMapper, removing the need for 'hidden' mapper implementations on text field, parent joins, and elsewhere.

…64592) Index-time analyzers are currently specified on the MappedFieldType. This has a number of unfortunate consequences; for example, field mappers that index data into implementation sub-fields, such as prefix or phrase accelerators on text fields, need to expose these sub-fields as MappedFieldTypes, which means that they then appear in field caps, are externally searchable, etc. It also adds index-time logic to a class that should only be concerned with search-time behaviour. This commit removes references to the index analyzer from MappedFieldType. Instead, FieldMappers that use the terms index can pass either a single analyzer or a Map of fields to analyzers to their super constructor, which are then exposed via a new FieldMapper#indexAnalyzers() method; all index-time analysis is mediated through the delegating analyzer wrapper on MapperService. In a follow-up, this will make it possible to register multiple field analyzers from a single FieldMapper, removing the need for 'hidden' mapper implementations on text field, parent joins, and elsewhere.

romseygeek added 2 commits October 20, 2020 14:00

Centralize index analyzer management

34977e4

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

3d8dfc7

romseygeek added :Search Foundations/Mapping Index mappings, including merging and defining field types >refactoring v8.0.0 v7.11.0 labels Oct 20, 2020

romseygeek requested review from javanna and jtibshirani October 20, 2020 13:12

romseygeek self-assigned this Oct 20, 2020

elasticmachine added the Team:Search Meta label for search team label Oct 20, 2020

romseygeek mentioned this pull request Oct 20, 2020

Move indexAnalyzer to FieldTypeLookup #63932

Closed

romseygeek commented Oct 20, 2020

View reviewed changes

romseygeek added 5 commits October 20, 2020 14:39

precommit

d098bcd

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

3a8807a

delegating analyzers

4cccf09

precommit

c9437df

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

1885058

position increments on annotations

32382ad

javanna requested changes Oct 21, 2020

View reviewed changes

romseygeek added 8 commits October 26, 2020 16:21

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

0b963be

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

d6adf80

Move to returning maps

f1b8095

imports

bb81e25

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

2585186

Make everything a NamedAnalyzer

4f3da1d

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

be55995

warnings

73044b5

romseygeek added 2 commits October 29, 2020 16:30

Collapse ParametrizedFieldMapper into FieldMapper

c04c7d1

SAYT shouldn't try to add fields if index=false and store=false

639126e

javanna reviewed Oct 30, 2020

View reviewed changes

romseygeek added 5 commits November 2, 2020 12:13

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

feea031

Merge branch 'mapper/fieldmapper' into mapper/indexanalyzer

9ee7dda

Make indexAnalyzers passed via constructor

88ad7e5

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

040ec02

SearchAsYouType tests

0d2a081

romseygeek requested a review from javanna November 2, 2020 17:02

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

7b28b2c

Merge branch 'master' into mapper/indexanalyzer

22357b7

Merge branch 'master' into mapper/indexanalyzer

4793097

javanna reviewed Nov 4, 2020

View reviewed changes

romseygeek added 3 commits November 4, 2020 11:26

Merge remote-tracking branch 'origin/master' into mapper/indexanalyzer

a9051a4

Merge remote-tracking branch 'romseygeek/mapper/indexanalyzer' into m…

342525c

…apper/indexanalyzer

Don't call FetchContext#mapperService

5099a60

javanna reviewed Nov 4, 2020

View reviewed changes

No really, don't use mapperservice

e4f7882

javanna approved these changes Nov 4, 2020

View reviewed changes

romseygeek merged commit f010269 into elastic:master Nov 4, 2020

romseygeek deleted the mapper/indexanalyzer branch November 4, 2020 13:53

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move index analyzer management to FieldMapper/MapperService #63937

Move index analyzer management to FieldMapper/MapperService #63937

romseygeek commented Oct 20, 2020

elasticmachine commented Oct 20, 2020

romseygeek Oct 20, 2020

romseygeek Oct 20, 2020

romseygeek Oct 20, 2020

romseygeek Oct 20, 2020

romseygeek Oct 20, 2020

romseygeek Oct 20, 2020

romseygeek Oct 20, 2020

romseygeek Oct 20, 2020

romseygeek Oct 20, 2020

romseygeek commented Oct 21, 2020

javanna left a comment

javanna Oct 21, 2020

romseygeek Oct 26, 2020

javanna Oct 27, 2020

romseygeek Oct 29, 2020

romseygeek commented Oct 29, 2020

romseygeek commented Oct 29, 2020

javanna left a comment

javanna Oct 30, 2020

romseygeek commented Nov 2, 2020

romseygeek commented Nov 3, 2020

romseygeek commented Nov 4, 2020

javanna Nov 4, 2020 •

edited

Loading

romseygeek Nov 4, 2020

javanna left a comment

javanna Nov 4, 2020

		@@ -569,8 +550,11 @@ public Query existsQuery(QueryShardContext context) {

		private static final class PhraseFieldMapper extends FieldMapper {

		@@ -506,6 +468,8 @@ protected static String indexOptionToString(IndexOptions indexOption) {

		protected abstract String contentType();

		public abstract void registerIndexAnalyzer(BiConsumer<String, Analyzer> analyzerRegistry);

Move index analyzer management to FieldMapper/MapperService #63937

Move index analyzer management to FieldMapper/MapperService #63937

Conversation

romseygeek commented Oct 20, 2020

elasticmachine commented Oct 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Oct 21, 2020

javanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Oct 29, 2020

romseygeek commented Oct 29, 2020

javanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Nov 2, 2020

romseygeek commented Nov 3, 2020

romseygeek commented Nov 4, 2020

javanna Nov 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

javanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

javanna Nov 4, 2020 •

edited

Loading