Enable reloading of search time analyzers #40782

cbuescher · 2019-04-03T13:46:38Z

A WIP PR that adds the ability to reload search-time analyzers upon user request. Currently this PR still piggybacks on a refresh action but on a final version this should happen through a dedicated new endpoint. The current approach re-builds part of the IndexAnalyzers (the analyzers that are marked as search-time analyzers). This rebuild is currently done in MapperService since it needs access to the current IndexAnalyzers, the rebuild itself is delegated to AnalysisRegistry because access to the required analysis component providers is available there.
A single-node integration test shows how this is working in the case of an updateable synonym filter when the underlying synonyms-file is changed.

Things that I assume are still needed:

adding dedicated unit test for the newly introduced methods (skipping this as long as the current approach isn't final)
adding rest/multi node IT
adding a new "reloadSearchAnalysis" endpoint instead of piggybacking on refresh

Opening as WIP regardless to get some feedback.

Note: some of the changes here are part of an open refactoring PR (#40609) which I hope to merge soon

elasticmachine · 2019-04-03T13:46:40Z

Pinging @elastic/es-search

cbuescher · 2019-04-03T13:47:45Z

server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

-            return getTokenizerProvider(tokenizer);
-        }
+        return getProvider(Component.TOKENIZER, tokenizer, indexSettings, "index.analysis.tokenizer", tokenizers,
+                this::getTokenizerProvider);


These are just refactoring that can eventually be pulled out of this PR

cbuescher · 2019-04-03T13:48:52Z

server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

+     * Create an new IndexAnalyzer instance based on the existing one. Analyzers that are in {@link AnalysisMode#SEARCH_TIME} are tried to
+     * be reloaded. All other analyzers are reused from the old {@link IndexAnalyzers} instance.
+     */
+    public IndexAnalyzers rebuildIndexAnalyzers(IndexAnalyzers indexAnalyzers, IndexSettings indexSettings) throws IOException {


This and the following method are the main places where reloading happens

I wonder if we should collect the list of filters, char_filters and tokenizers that need a rebuild in all search analyzer first. The current solution seems to rebuild the filters once per analyzer instead of only once.

Ok I think I understand why I got confused. Synonym token filters are special because they delay the build of the synonym map until all other filters are known. This is the reason why your code is working even though you're not rebuilding the token and char filter factories, the synonym map is rebuild when the custom analyzer is finally built.
However if we want to make the update work on other type of filters we'll need to rebuild the factories first and then use these new factories to build the custom analyzers.

cbuescher · 2019-04-03T13:52:02Z

server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

+                    .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+            final Map<String, Settings> charFiltersSettings = indexSettings.getSettings().getGroups(INDEX_ANALYSIS_CHAR_FILTER);
+            final Map<String, CharFilterFactory> charFilters = buildMapping(Component.CHAR_FILTER, indexSettings,
+                    charFiltersSettings, charFilterProvider, Collections.emptyMap());


I hope the above can still be reduced, I was e.g. toying with the ides of adding references to the Map<String, XYZFactory> for tokenizers and charfilters to IndexAnalyzers to re-use them when rebuilding. we create them when initially creating IndexAnalyzers in AnalysisRegistry#build but then throw them away. Since they shouldn't change over time (at least the tokenizer and char filters, maybe also part of the token filters) we could reuse them if they were available via IndexAnalyzers. I didn't go down this road yet though.

jimczi

I left some comments and questions but it looks good @cbuescher . Regarding the API to reload the search analyzers, we discussed with Adrien and agreed that we could start with an experimental endpoint (_reload_search_analyzer or less verbose ;) ). The idea is to expose this feature through this API for the moment and revise this methodology later when we refactor updates on analyzer settings through a regular mapping update.

server/src/main/java/org/elasticsearch/index/mapper/MapperService.java

jimczi · 2019-04-09T07:58:15Z

server/src/main/java/org/elasticsearch/index/analysis/IndexAnalyzers.java

-        this.whitespaceNormalizers = unmodifiableMap(whitespaceNormalizers);
+        this.analyzers =  unmodifiableMap(new HashMap<>(analyzers));
+        this.normalizers = unmodifiableMap(new HashMap<>(normalizers));
+        this.whitespaceNormalizers = unmodifiableMap(new HashMap<>(whitespaceNormalizers));


Why is it needed to copy the map (e.g. new HashMap) since you never change the inner maps ?

I added this to make sure whatever class is passing in the analyzers/normalizers/whitespaceNormalizers map cannot hold on to a rerence to the map and modify it after creation. As far as I can see we currently don't do this (e.g. in AnalysisRegistry#build the input maps are all discarded after use), but this is a defensive copy to prevent any accidental references to the underlying maps leaking. I think this makes reasoning about the behaviour of this class much sager and shouldn't add too much overhead.

server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

jimczi · 2019-04-09T08:02:58Z

server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

+     * Create an new IndexAnalyzer instance based on the existing one. Analyzers that are in {@link AnalysisMode#SEARCH_TIME} are tried to
+     * be reloaded. All other analyzers are reused from the old {@link IndexAnalyzers} instance.
+     */
+    public IndexAnalyzers rebuildIndexAnalyzers(IndexAnalyzers indexAnalyzers, IndexSettings indexSettings) throws IOException {


I wonder if we should collect the list of filters, char_filters and tokenizers that need a rebuild in all search analyzer first. The current solution seems to rebuild the filters once per analyzer instead of only once.

server/src/main/java/org/elasticsearch/index/mapper/MapperService.java

server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

cbuescher · 2019-04-15T14:04:47Z

@jimczi I adressed some comments of yours. Wrt rebuilding the AnalysisProviders needed to re-create the analyzer I opted for another idea I mentioned earlier which keeps the map of AnalysisProviders used to initially create the IndexAnalyzers around as part of it. This means we should only have to re-build the searchtime analyzers and the filters etc... they contain.
I will work on the tests and the custom endpoint next if the general approach looks okay to you.

cbuescher · 2019-05-22T13:35:49Z

server/src/main/java/org/elasticsearch/index/analysis/CustomAnalyzer.java

+    /**
+     * TODO: We should not expose functions that return objects from the <code>current</code>,
+     * only the full {@link AnalyzerComponents} should be returned
+     */


@jimczi I left this comment of yours because I'm not sure what would change if we returned only the full AnalyzerComponents. Clients that now usse the following getters would still need to access indovidual components on it with the potential of changing their state, or am I missing something? Also I think it would be good to keep the API of CustomAnalyzer stable for now since it might be one of the more popular APIs for use in custom plugins etc..?

We change the entire AnalyzerComponents on update so the idea exposed here is that users should not access the underlying object individually. Instead they should retrieve the AnalyzerComponents once (it is a volatile object) and work with the underlying objects since they are final:

final AnalyzerComponents components = analyzer.getComponents(); // we need charFilters and tokenFilters to do something so we need to ensure that they come // from the same component mixFilters(components.charFilters, components.tokenFilters);

cbuescher · 2019-05-22T13:51:40Z

@jimczi I pushed a few commits removing code we don't need with your solution, also subclassing CustomAnalyzer with a ReloadableCustomAnalyzer. I toyed with the idea of simply keeping everything in CustomAnalyzer and not use the alternative ReloadStrategy and the ThreadLocal if the AnalysisMode != SEARCH. That would have the advantage of being able to keep CustomAnalyzer final and also hide the additional logic better I think. Let me know what you think, I can iterate on this again.

Some internal refactorings to the AnalysisRegistry, spin-off from elastic#40782.

Some internal refactorings to the AnalysisRegistry, spin-off from #40782.

Some internal refactorings to the AnalysisRegistry, spin-off from elastic#40782.

…xtend it

jimczi

I left some comments but I think it's ready for a more in-depth review by @jpountz and @romseygeek focusing on the custom reuse strategy that we use to circumvent the issue with the recycling of analyzers in Lucene.

jimczi · 2019-05-29T12:00:55Z

...va/org/elasticsearch/action/admin/indices/reloadanalyzer/TransportReloadAnalyzersAction.java

+                }
+            }
+        }
+        logger.info("Determined shards for reloading: " + shards);


jimczi · 2019-05-29T12:02:39Z

server/src/main/java/org/elasticsearch/index/analysis/CustomAnalyzerProvider.java

+            charFiltersList.toArray(new CharFilterFactory[charFiltersList.size()]),
+            tokenFilterList.toArray(new TokenFilterFactory[tokenFilterList.size()]),
+            positionIncrementGap,
+            offsetGap


positionIncrementGap and offsetGap should be pulled off from this function ?

jimczi · 2019-05-29T12:02:56Z

server/src/main/java/org/elasticsearch/index/analysis/CustomAnalyzerProvider.java

+
+        AnalyzerComponents(String tokenizerName, TokenizerFactory tokenizerFactory, CharFilterFactory[] charFilters,
+                           TokenFilterFactory[] tokenFilters,
+                           int positionIncrementGap, int offsetGap) {


Remove positionIncrementGap and offsetGap?

jimczi · 2019-05-29T12:05:24Z

server/src/main/java/org/elasticsearch/index/analysis/ReloadableCustomAnalyzer.java

+
+    private final int offsetGap;
+
+    private static final ReuseStrategy UPDATE_STRATEGY = new ReuseStrategy() {


Can you add a comment regarding why we need this strategy ?

jimczi · 2019-05-29T12:07:25Z

server/src/main/java/org/elasticsearch/index/analysis/ReloadableCustomAnalyzer.java

+
+    @Override
+    public TokenFilterFactory[] tokenFilters() {
+        return this.components.getTokenFilters();


This is dangerous since we shouldn't use these filters if we also need to access other members of the AnalyzerComponents. Why do you need this TokenFilterComposite? It seems that it is used only in one place so I wonder if we could remove this function and adapt the code to check for CustomAnalyzer or ReloadableCustomAnalyzer ?

I introduced the TokenFilterComposite in three places (NamedAnalyzer.checkAllowedInMode, FragmentBuilderHelper.containsBrokenAnalysis, PhraseSuggestionBuilder.getShingleFilterFactory) and one test where we otherwise would have to do perform two instanceof checks and have different access patterns for the CustomAnalyzer or ReloadableCustomAnalyzer case. In both cases what we really only need read access to the filters to check some of their properties.

This is dangerous since we shouldn't use these filters if we also need to access other members of the AnalyzerComponents

I understand, but in these cases we are only accessing the filters, no other components (and only on read), so I don't think its dangerous and simplifies the code. I'd like to keep it to keep the code simple, wdyt?

jimczi · 2019-05-29T12:08:15Z

server/src/main/java/org/elasticsearch/index/analysis/ReloadableCustomAnalyzer.java

+import java.io.Reader;
+import java.util.Map;
+
+public class ReloadableCustomAnalyzer extends Analyzer implements TokenFilterComposite {


let's make it final ?

cbuescher · 2019-05-29T12:52:23Z

@jimczi great, thanks. I pushed an update adressing most of your comments and one reply, will open a feature branch like we discussed and squash this branch down to a new PR agains that for further reviews to clean up the history of this a bit.

cbuescher · 2019-05-29T13:17:34Z

fyi I opened #42669 agains a new feature branch to be able to work in documentation in parallel to the general CI/Review cycle. If its okay lets move further discussion over there and I will close this WIP PR in a bit.

cbuescher · 2019-06-04T12:17:19Z

Continued in #42669 so closing here.

cbuescher added >enhancement WIP :Search Relevance/Analysis How text is split into tokens v8.0.0 labels Apr 3, 2019

cbuescher commented Apr 3, 2019

View reviewed changes

cbuescher force-pushed the updateable-synonyms branch from 0536a95 to de2e750 Compare April 4, 2019 22:27

WIP

750f676

cbuescher force-pushed the updateable-synonyms branch from de2e750 to 750f676 Compare April 8, 2019 18:56

jimczi reviewed Apr 9, 2019

View reviewed changes

Christoph Büscher added 3 commits April 15, 2019 12:20

Merge branch 'master' into updateable-synonyms

99a5e4a

adressing comments

2a7951b

Access analysis providers via IndexAnalyzers

07b8156

Christoph Büscher added 2 commits April 15, 2019 20:45

Merge branch 'master' into updateable-synonyms

6a5afce

Merge branch 'master' into updateable-synonyms

095f140

cbuescher force-pushed the updateable-synonyms branch from 647a145 to 67fa5d0 Compare April 17, 2019 17:10

Improve reloading token filters

3de3ec5

cbuescher force-pushed the updateable-synonyms branch from 67fa5d0 to 3de3ec5 Compare April 17, 2019 20:53

Christoph Büscher added 5 commits April 18, 2019 19:16

Adding AnalysisRegistry unit tests

5daca94

Merge branch 'master' into updateable-synonyms

1ba86f0

Fix small compile issue

97f1c58

Add unit test for AnalysisRegistry#reloadIndexAnalyzers

9aa1f28

More tests

abb6523

cbuescher force-pushed the updateable-synonyms branch from 1defe8d to abb6523 Compare April 24, 2019 11:00

Christoph Büscher added 2 commits April 24, 2019 16:35

Merge branch 'master' into updateable-synonyms

3f02fa5

Adding ReloadAnalyzerAction

66c7bfa

cbuescher force-pushed the updateable-synonyms branch from 521d605 to 66c7bfa Compare April 26, 2019 13:40

cbuescher commented May 22, 2019

View reviewed changes

Christoph Büscher added 3 commits May 22, 2019 15:39

Removing more stuff from IndexAnalyzers

d486308

Merge branch 'master' into updateable-synonyms

8084dc3

Some more cleanups

66091a2

Christoph Büscher added 5 commits May 22, 2019 18:35

Move getters into AnalyzerComponents

b4fb376

Merge branch 'master' into updateable-synonyms

e6d7fd8

Fix SynonymAnalyzerIT so it runs in gradle

ac82f93

Merge branch 'master' into updateable-synonyms

833b3a2

iter

5efac3d

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request May 24, 2019

Small internal AnalysisRegistry changes

ca46907

Some internal refactorings to the AnalysisRegistry, spin-off from elastic#40782.

cbuescher mentioned this pull request May 24, 2019

Small internal AnalysisRegistry changes #42500

Merged

cbuescher pushed a commit that referenced this pull request May 24, 2019

Small internal AnalysisRegistry changes (#42500)

cbf1150

Some internal refactorings to the AnalysisRegistry, spin-off from #40782.

cbuescher pushed a commit that referenced this pull request May 24, 2019

Small internal AnalysisRegistry changes (#42500)

12d5642

Some internal refactorings to the AnalysisRegistry, spin-off from #40782.

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019

Small internal AnalysisRegistry changes (elastic#42500)

997352a

Some internal refactorings to the AnalysisRegistry, spin-off from elastic#40782.

Christoph Büscher added 2 commits May 28, 2019 17:16

Merge branch 'master' into updateable-synonyms

0ec5ac9

Make ReloadableCustomAnalyzer sibling of CustomAnalyzer rather than e…

03fa799

…xtend it

cbuescher force-pushed the updateable-synonyms branch from 20ffcc8 to 03fa799 Compare May 28, 2019 19:59

Pull positionIncrement and offsetGap out of AnalyzerComponents

6d12084

jimczi reviewed May 29, 2019

View reviewed changes

Christoph Büscher added 2 commits May 29, 2019 14:42

Merge branch 'master' into updateable-synonyms

9a35f6a

iter

4affb0b

cbuescher closed this Jun 4, 2019

This was referenced Jun 4, 2019

Allow reloading of search time analyzers #42669

Merged

Allow reloading of search time analyzers #42888

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable reloading of search time analyzers #40782

Enable reloading of search time analyzers #40782

cbuescher commented Apr 3, 2019 •

edited

Loading

elasticmachine commented Apr 3, 2019

cbuescher Apr 3, 2019

cbuescher Apr 3, 2019

jimczi Apr 9, 2019

jimczi Apr 17, 2019

cbuescher Apr 3, 2019

jimczi left a comment

jimczi Apr 9, 2019

cbuescher Apr 15, 2019

jimczi Apr 9, 2019

cbuescher commented Apr 15, 2019

cbuescher May 22, 2019

jimczi May 22, 2019

cbuescher commented May 22, 2019

jimczi left a comment

jimczi May 29, 2019

jimczi May 29, 2019

jimczi May 29, 2019

jimczi May 29, 2019

jimczi May 29, 2019

cbuescher May 29, 2019

jimczi May 29, 2019

cbuescher commented May 29, 2019 •

edited

Loading

cbuescher commented May 29, 2019

cbuescher commented Jun 4, 2019


		private final int offsetGap;

		private static final ReuseStrategy UPDATE_STRATEGY = new ReuseStrategy() {

Enable reloading of search time analyzers #40782

Enable reloading of search time analyzers #40782

Conversation

cbuescher commented Apr 3, 2019 • edited Loading

elasticmachine commented Apr 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher commented Apr 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher commented May 22, 2019

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher commented May 29, 2019 • edited Loading

cbuescher commented May 29, 2019

cbuescher commented Jun 4, 2019

cbuescher commented Apr 3, 2019 •

edited

Loading

cbuescher commented May 29, 2019 •

edited

Loading