-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable reloading of search time analyzers #40782
Conversation
Pinging @elastic/es-search |
return getTokenizerProvider(tokenizer); | ||
} | ||
return getProvider(Component.TOKENIZER, tokenizer, indexSettings, "index.analysis.tokenizer", tokenizers, | ||
this::getTokenizerProvider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just refactoring that can eventually be pulled out of this PR
* Create an new IndexAnalyzer instance based on the existing one. Analyzers that are in {@link AnalysisMode#SEARCH_TIME} are tried to | ||
* be reloaded. All other analyzers are reused from the old {@link IndexAnalyzers} instance. | ||
*/ | ||
public IndexAnalyzers rebuildIndexAnalyzers(IndexAnalyzers indexAnalyzers, IndexSettings indexSettings) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the following method are the main places where reloading happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should collect the list of filters, char_filters and tokenizers that need a rebuild in all search analyzer first. The current solution seems to rebuild the filters once per analyzer instead of only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think I understand why I got confused. Synonym token filters are special because they delay the build of the synonym map until all other filters are known. This is the reason why your code is working even though you're not rebuilding the token and char filter factories, the synonym map is rebuild when the custom analyzer is finally built.
However if we want to make the update work on other type of filters we'll need to rebuild the factories first and then use these new factories to build the custom analyzers.
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue)); | ||
final Map<String, Settings> charFiltersSettings = indexSettings.getSettings().getGroups(INDEX_ANALYSIS_CHAR_FILTER); | ||
final Map<String, CharFilterFactory> charFilters = buildMapping(Component.CHAR_FILTER, indexSettings, | ||
charFiltersSettings, charFilterProvider, Collections.emptyMap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope the above can still be reduced, I was e.g. toying with the ides of adding references to the Map<String, XYZFactory>
for tokenizers and charfilters to IndexAnalyzers to re-use them when rebuilding. we create them when initially creating IndexAnalyzers in AnalysisRegistry#build but then throw them away. Since they shouldn't change over time (at least the tokenizer and char filters, maybe also part of the token filters) we could reuse them if they were available via IndexAnalyzers. I didn't go down this road yet though.
0536a95
to
de2e750
Compare
de2e750
to
750f676
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments and questions but it looks good @cbuescher . Regarding the API to reload the search analyzers, we discussed with Adrien and agreed that we could start with an experimental endpoint (_reload_search_analyzer
or less verbose ;) ). The idea is to expose this feature through this API for the moment and revise this methodology later when we refactor updates on analyzer settings through a regular mapping update.
server/src/main/java/org/elasticsearch/index/mapper/MapperService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/MapperService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/MapperService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/MapperService.java
Outdated
Show resolved
Hide resolved
this.whitespaceNormalizers = unmodifiableMap(whitespaceNormalizers); | ||
this.analyzers = unmodifiableMap(new HashMap<>(analyzers)); | ||
this.normalizers = unmodifiableMap(new HashMap<>(normalizers)); | ||
this.whitespaceNormalizers = unmodifiableMap(new HashMap<>(whitespaceNormalizers)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it needed to copy the map (e.g. new HashMap
) since you never change the inner maps ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to make sure whatever class is passing in the analyzers/normalizers/whitespaceNormalizers map cannot hold on to a rerence to the map and modify it after creation. As far as I can see we currently don't do this (e.g. in AnalysisRegistry#build the input maps are all discarded after use), but this is a defensive copy to prevent any accidental references to the underlying maps leaking. I think this makes reasoning about the behaviour of this class much sager and shouldn't add too much overhead.
server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java
Outdated
Show resolved
Hide resolved
* Create an new IndexAnalyzer instance based on the existing one. Analyzers that are in {@link AnalysisMode#SEARCH_TIME} are tried to | ||
* be reloaded. All other analyzers are reused from the old {@link IndexAnalyzers} instance. | ||
*/ | ||
public IndexAnalyzers rebuildIndexAnalyzers(IndexAnalyzers indexAnalyzers, IndexSettings indexSettings) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should collect the list of filters, char_filters and tokenizers that need a rebuild in all search analyzer first. The current solution seems to rebuild the filters once per analyzer instead of only once.
server/src/main/java/org/elasticsearch/index/mapper/MapperService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java
Outdated
Show resolved
Hide resolved
@jimczi I adressed some comments of yours. Wrt rebuilding the AnalysisProviders needed to re-create the analyzer I opted for another idea I mentioned earlier which keeps the map of AnalysisProviders used to initially create the IndexAnalyzers around as part of it. This means we should only have to re-build the searchtime analyzers and the filters etc... they contain. |
647a145
to
67fa5d0
Compare
67fa5d0
to
3de3ec5
Compare
1defe8d
to
abb6523
Compare
521d605
to
66c7bfa
Compare
/** | ||
* TODO: We should not expose functions that return objects from the <code>current</code>, | ||
* only the full {@link AnalyzerComponents} should be returned | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi I left this comment of yours because I'm not sure what would change if we returned only the full AnalyzerComponents. Clients that now usse the following getters would still need to access indovidual components on it with the potential of changing their state, or am I missing something? Also I think it would be good to keep the API of CustomAnalyzer stable for now since it might be one of the more popular APIs for use in custom plugins etc..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We change the entire AnalyzerComponents on update so the idea exposed here is that users should not access the underlying object individually. Instead they should retrieve the AnalyzerComponents once (it is a volatile object) and work with the underlying objects since they are final:
final AnalyzerComponents components = analyzer.getComponents();
// we need charFilters and tokenFilters to do something so we need to ensure that they come
// from the same component
mixFilters(components.charFilters, components.tokenFilters);
@jimczi I pushed a few commits removing code we don't need with your solution, also subclassing CustomAnalyzer with a ReloadableCustomAnalyzer. I toyed with the idea of simply keeping everything in CustomAnalyzer and not use the alternative ReloadStrategy and the ThreadLocal if the AnalysisMode != SEARCH. That would have the advantage of being able to keep CustomAnalyzer final and also hide the additional logic better I think. Let me know what you think, I can iterate on this again. |
Some internal refactorings to the AnalysisRegistry, spin-off from elastic#40782.
Some internal refactorings to the AnalysisRegistry, spin-off from #40782.
Some internal refactorings to the AnalysisRegistry, spin-off from #40782.
Some internal refactorings to the AnalysisRegistry, spin-off from elastic#40782.
20ffcc8
to
03fa799
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but I think it's ready for a more in-depth review by @jpountz and @romseygeek focusing on the custom reuse strategy that we use to circumvent the issue with the recycling of analyzers in Lucene.
} | ||
} | ||
} | ||
logger.info("Determined shards for reloading: " + shards); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover ?
charFiltersList.toArray(new CharFilterFactory[charFiltersList.size()]), | ||
tokenFilterList.toArray(new TokenFilterFactory[tokenFilterList.size()]), | ||
positionIncrementGap, | ||
offsetGap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
positionIncrementGap
and offsetGap
should be pulled off from this function ?
|
||
AnalyzerComponents(String tokenizerName, TokenizerFactory tokenizerFactory, CharFilterFactory[] charFilters, | ||
TokenFilterFactory[] tokenFilters, | ||
int positionIncrementGap, int offsetGap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove positionIncrementGap
and offsetGap
?
|
||
private final int offsetGap; | ||
|
||
private static final ReuseStrategy UPDATE_STRATEGY = new ReuseStrategy() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment regarding why we need this strategy ?
|
||
@Override | ||
public TokenFilterFactory[] tokenFilters() { | ||
return this.components.getTokenFilters(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dangerous since we shouldn't use these filters if we also need to access other members of the AnalyzerComponents
. Why do you need this TokenFilterComposite
? It seems that it is used only in one place so I wonder if we could remove this function and adapt the code to check for CustomAnalyzer
or ReloadableCustomAnalyzer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced the TokenFilterComposite in three places (NamedAnalyzer.checkAllowedInMode, FragmentBuilderHelper.containsBrokenAnalysis, PhraseSuggestionBuilder.getShingleFilterFactory) and one test where we otherwise would have to do perform two instanceof
checks and have different access patterns for the CustomAnalyzer
or ReloadableCustomAnalyzer
case. In both cases what we really only need read access to the filters to check some of their properties.
This is dangerous since we shouldn't use these filters if we also need to access other members of the AnalyzerComponents
I understand, but in these cases we are only accessing the filters, no other components (and only on read), so I don't think its dangerous and simplifies the code. I'd like to keep it to keep the code simple, wdyt?
import java.io.Reader; | ||
import java.util.Map; | ||
|
||
public class ReloadableCustomAnalyzer extends Analyzer implements TokenFilterComposite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make it final ?
@jimczi great, thanks. I pushed an update adressing most of your comments and one reply, will open a feature branch like we discussed and squash this branch down to a new PR agains that for further reviews to clean up the history of this a bit. |
fyi I opened #42669 agains a new feature branch to be able to work in documentation in parallel to the general CI/Review cycle. If its okay lets move further discussion over there and I will close this WIP PR in a bit. |
Continued in #42669 so closing here. |
A WIP PR that adds the ability to reload search-time analyzers upon user request. Currently this PR still piggybacks on a refresh action but on a final version this should happen through a dedicated new endpoint. The current approach re-builds part of the IndexAnalyzers (the analyzers that are marked as search-time analyzers). This rebuild is currently done in MapperService since it needs access to the current IndexAnalyzers, the rebuild itself is delegated to AnalysisRegistry because access to the required analysis component providers is available there.
A single-node integration test shows how this is working in the case of an updateable synonym filter when the underlying synonyms-file is changed.
Things that I assume are still needed:
Opening as WIP regardless to get some feedback.
Note: some of the changes here are part of an open refactoring PR (#40609) which I hope to merge soon