Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable reloading of search time analyzers #40782

Closed
wants to merge 33 commits into from

Conversation

cbuescher
Copy link
Member

@cbuescher cbuescher commented Apr 3, 2019

A WIP PR that adds the ability to reload search-time analyzers upon user request. Currently this PR still piggybacks on a refresh action but on a final version this should happen through a dedicated new endpoint. The current approach re-builds part of the IndexAnalyzers (the analyzers that are marked as search-time analyzers). This rebuild is currently done in MapperService since it needs access to the current IndexAnalyzers, the rebuild itself is delegated to AnalysisRegistry because access to the required analysis component providers is available there.
A single-node integration test shows how this is working in the case of an updateable synonym filter when the underlying synonyms-file is changed.

Things that I assume are still needed:

  • adding dedicated unit test for the newly introduced methods (skipping this as long as the current approach isn't final)
  • adding rest/multi node IT
  • adding a new "reloadSearchAnalysis" endpoint instead of piggybacking on refresh

Opening as WIP regardless to get some feedback.

Note: some of the changes here are part of an open refactoring PR (#40609) which I hope to merge soon

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

return getTokenizerProvider(tokenizer);
}
return getProvider(Component.TOKENIZER, tokenizer, indexSettings, "index.analysis.tokenizer", tokenizers,
this::getTokenizerProvider);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just refactoring that can eventually be pulled out of this PR

* Create an new IndexAnalyzer instance based on the existing one. Analyzers that are in {@link AnalysisMode#SEARCH_TIME} are tried to
* be reloaded. All other analyzers are reused from the old {@link IndexAnalyzers} instance.
*/
public IndexAnalyzers rebuildIndexAnalyzers(IndexAnalyzers indexAnalyzers, IndexSettings indexSettings) throws IOException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the following method are the main places where reloading happens

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should collect the list of filters, char_filters and tokenizers that need a rebuild in all search analyzer first. The current solution seems to rebuild the filters once per analyzer instead of only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I understand why I got confused. Synonym token filters are special because they delay the build of the synonym map until all other filters are known. This is the reason why your code is working even though you're not rebuilding the token and char filter factories, the synonym map is rebuild when the custom analyzer is finally built.
However if we want to make the update work on other type of filters we'll need to rebuild the factories first and then use these new factories to build the custom analyzers.

.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
final Map<String, Settings> charFiltersSettings = indexSettings.getSettings().getGroups(INDEX_ANALYSIS_CHAR_FILTER);
final Map<String, CharFilterFactory> charFilters = buildMapping(Component.CHAR_FILTER, indexSettings,
charFiltersSettings, charFilterProvider, Collections.emptyMap());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the above can still be reduced, I was e.g. toying with the ides of adding references to the Map<String, XYZFactory> for tokenizers and charfilters to IndexAnalyzers to re-use them when rebuilding. we create them when initially creating IndexAnalyzers in AnalysisRegistry#build but then throw them away. Since they shouldn't change over time (at least the tokenizer and char filters, maybe also part of the token filters) we could reuse them if they were available via IndexAnalyzers. I didn't go down this road yet though.

@cbuescher cbuescher force-pushed the updateable-synonyms branch from 0536a95 to de2e750 Compare April 4, 2019 22:27
@cbuescher cbuescher force-pushed the updateable-synonyms branch from de2e750 to 750f676 Compare April 8, 2019 18:56
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments and questions but it looks good @cbuescher . Regarding the API to reload the search analyzers, we discussed with Adrien and agreed that we could start with an experimental endpoint (_reload_search_analyzer or less verbose ;) ). The idea is to expose this feature through this API for the moment and revise this methodology later when we refactor updates on analyzer settings through a regular mapping update.

this.whitespaceNormalizers = unmodifiableMap(whitespaceNormalizers);
this.analyzers = unmodifiableMap(new HashMap<>(analyzers));
this.normalizers = unmodifiableMap(new HashMap<>(normalizers));
this.whitespaceNormalizers = unmodifiableMap(new HashMap<>(whitespaceNormalizers));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed to copy the map (e.g. new HashMap) since you never change the inner maps ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to make sure whatever class is passing in the analyzers/normalizers/whitespaceNormalizers map cannot hold on to a rerence to the map and modify it after creation. As far as I can see we currently don't do this (e.g. in AnalysisRegistry#build the input maps are all discarded after use), but this is a defensive copy to prevent any accidental references to the underlying maps leaking. I think this makes reasoning about the behaviour of this class much sager and shouldn't add too much overhead.

* Create an new IndexAnalyzer instance based on the existing one. Analyzers that are in {@link AnalysisMode#SEARCH_TIME} are tried to
* be reloaded. All other analyzers are reused from the old {@link IndexAnalyzers} instance.
*/
public IndexAnalyzers rebuildIndexAnalyzers(IndexAnalyzers indexAnalyzers, IndexSettings indexSettings) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should collect the list of filters, char_filters and tokenizers that need a rebuild in all search analyzer first. The current solution seems to rebuild the filters once per analyzer instead of only once.

@cbuescher
Copy link
Member Author

@jimczi I adressed some comments of yours. Wrt rebuilding the AnalysisProviders needed to re-create the analyzer I opted for another idea I mentioned earlier which keeps the map of AnalysisProviders used to initially create the IndexAnalyzers around as part of it. This means we should only have to re-build the searchtime analyzers and the filters etc... they contain.
I will work on the tests and the custom endpoint next if the general approach looks okay to you.

@cbuescher cbuescher force-pushed the updateable-synonyms branch from 647a145 to 67fa5d0 Compare April 17, 2019 17:10
@cbuescher cbuescher force-pushed the updateable-synonyms branch from 67fa5d0 to 3de3ec5 Compare April 17, 2019 20:53
@cbuescher cbuescher force-pushed the updateable-synonyms branch from 1defe8d to abb6523 Compare April 24, 2019 11:00
@cbuescher cbuescher force-pushed the updateable-synonyms branch from 521d605 to 66c7bfa Compare April 26, 2019 13:40
/**
* TODO: We should not expose functions that return objects from the <code>current</code>,
* only the full {@link AnalyzerComponents} should be returned
*/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi I left this comment of yours because I'm not sure what would change if we returned only the full AnalyzerComponents. Clients that now usse the following getters would still need to access indovidual components on it with the potential of changing their state, or am I missing something? Also I think it would be good to keep the API of CustomAnalyzer stable for now since it might be one of the more popular APIs for use in custom plugins etc..?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We change the entire AnalyzerComponents on update so the idea exposed here is that users should not access the underlying object individually. Instead they should retrieve the AnalyzerComponents once (it is a volatile object) and work with the underlying objects since they are final:

  final AnalyzerComponents components = analyzer.getComponents();
  // we need charFilters and tokenFilters to do something so we need to ensure that they come 
  // from the same component
  mixFilters(components.charFilters, components.tokenFilters);

@cbuescher
Copy link
Member Author

@jimczi I pushed a few commits removing code we don't need with your solution, also subclassing CustomAnalyzer with a ReloadableCustomAnalyzer. I toyed with the idea of simply keeping everything in CustomAnalyzer and not use the alternative ReloadStrategy and the ThreadLocal if the AnalysisMode != SEARCH. That would have the advantage of being able to keep CustomAnalyzer final and also hide the additional logic better I think. Let me know what you think, I can iterate on this again.

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request May 24, 2019
Some internal refactorings to the AnalysisRegistry, spin-off from elastic#40782.
cbuescher pushed a commit that referenced this pull request May 24, 2019
Some internal refactorings to the AnalysisRegistry, spin-off from #40782.
cbuescher pushed a commit that referenced this pull request May 24, 2019
Some internal refactorings to the AnalysisRegistry, spin-off from #40782.
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
Some internal refactorings to the AnalysisRegistry, spin-off from elastic#40782.
@cbuescher cbuescher force-pushed the updateable-synonyms branch from 20ffcc8 to 03fa799 Compare May 28, 2019 19:59
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but I think it's ready for a more in-depth review by @jpountz and @romseygeek focusing on the custom reuse strategy that we use to circumvent the issue with the recycling of analyzers in Lucene.

}
}
}
logger.info("Determined shards for reloading: " + shards);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover ?

charFiltersList.toArray(new CharFilterFactory[charFiltersList.size()]),
tokenFilterList.toArray(new TokenFilterFactory[tokenFilterList.size()]),
positionIncrementGap,
offsetGap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

positionIncrementGap and offsetGap should be pulled off from this function ?


AnalyzerComponents(String tokenizerName, TokenizerFactory tokenizerFactory, CharFilterFactory[] charFilters,
TokenFilterFactory[] tokenFilters,
int positionIncrementGap, int offsetGap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove positionIncrementGap and offsetGap?


private final int offsetGap;

private static final ReuseStrategy UPDATE_STRATEGY = new ReuseStrategy() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment regarding why we need this strategy ?


@Override
public TokenFilterFactory[] tokenFilters() {
return this.components.getTokenFilters();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dangerous since we shouldn't use these filters if we also need to access other members of the AnalyzerComponents. Why do you need this TokenFilterComposite? It seems that it is used only in one place so I wonder if we could remove this function and adapt the code to check for CustomAnalyzer or ReloadableCustomAnalyzer ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced the TokenFilterComposite in three places (NamedAnalyzer.checkAllowedInMode, FragmentBuilderHelper.containsBrokenAnalysis, PhraseSuggestionBuilder.getShingleFilterFactory) and one test where we otherwise would have to do perform two instanceof checks and have different access patterns for the CustomAnalyzer or ReloadableCustomAnalyzer case. In both cases what we really only need read access to the filters to check some of their properties.

This is dangerous since we shouldn't use these filters if we also need to access other members of the AnalyzerComponents

I understand, but in these cases we are only accessing the filters, no other components (and only on read), so I don't think its dangerous and simplifies the code. I'd like to keep it to keep the code simple, wdyt?

import java.io.Reader;
import java.util.Map;

public class ReloadableCustomAnalyzer extends Analyzer implements TokenFilterComposite {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it final ?

@cbuescher
Copy link
Member Author

cbuescher commented May 29, 2019

@jimczi great, thanks. I pushed an update adressing most of your comments and one reply, will open a feature branch like we discussed and squash this branch down to a new PR agains that for further reviews to clean up the history of this a bit.

@cbuescher
Copy link
Member Author

fyi I opened #42669 agains a new feature branch to be able to work in documentation in parallel to the general CI/Review cycle. If its okay lets move further discussion over there and I will close this WIP PR in a bit.

@cbuescher
Copy link
Member Author

Continued in #42669 so closing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants