-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added default stopwords and link to Lucene #41173
Conversation
Pinging @elastic/es-search |
list of stop words. Defaults to `_english_`. | ||
|
||
`stopwords_path`:: | ||
|
||
The path to a file containing stop words. This path is relative to the | ||
Elasticsearch `config` directory. | ||
|
||
The default English stopwords used in Elasticsearch are: | ||
|
||
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still the case?
I know Lucene has a file of EN stop words here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt
Might be worth checking with an engineer in @elastic/es-search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also may make more sense to move this section under the stopwords
param. Or link to this section from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list looks good, the file you linked is for the snowball filters and analyzers but we don't use them in the stop
filter. The default list can be found here: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L46
for that language. | ||
|
||
Stop words in other supported languages can be accessed in Lucene: | ||
https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis[Stopwords] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth mentioning that some of these are under the snowball
path:
https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of stopwords for each language is different from snowball. You can find the entire list here:
https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/analysis/Analysis.java#L115
words, you can reduce the size of an index and increase performance, which is | ||
important when disk space and memory are critical factors. | ||
|
||
|
||
|
||
[float] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to add an anchor here like [[stop-analyzer-example-output]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop words are common words set not to be indexed. By removing a list of common
words, you can reduce the size of an index and increase performance, which is
important when disk space and memory are critical factors.
This section is a bit misleading since we added optimizations in Lucene 8 to speed up queries that contain stop words. Removing stop words is also challenging for other token filters like synonym_graph
and synonym
so now that we are able to run queries with stop words efficiently I am not sure we should add such advice in the docs. We could maybe just mention the disk space savings and add a note regarding the faster search if the total hits that match the query is not requested (making the removal of stop words not necessary) ?
they, this, to, was, will, with | ||
|
||
|
||
==== Stop words in supported languages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an anchor here as well.
Each language analyzer defaults to using the appropriate stopwords list | ||
for that language. | ||
|
||
Stop words in other supported languages can be accessed in Lucene: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passive voice
list of stop words. Defaults to `_english_`. | ||
|
||
`stopwords_path`:: | ||
|
||
The path to a file containing stop words. This path is relative to the | ||
Elasticsearch `config` directory. | ||
|
||
The default English stopwords used in Elasticsearch are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passive voice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if you're confident the default English stop words are accurate.
Made some minor suggestions and comments, but those are not blockers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kat257, I left some additional comments
list of stop words. Defaults to `_english_`. | ||
|
||
`stopwords_path`:: | ||
|
||
The path to a file containing stop words. This path is relative to the | ||
Elasticsearch `config` directory. | ||
|
||
The default English stopwords used in Elasticsearch are: | ||
|
||
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list looks good, the file you linked is for the snowball filters and analyzers but we don't use them in the stop
filter. The default list can be found here: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L46
words, you can reduce the size of an index and increase performance, which is | ||
important when disk space and memory are critical factors. | ||
|
||
|
||
|
||
[float] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop words are common words set not to be indexed. By removing a list of common
words, you can reduce the size of an index and increase performance, which is
important when disk space and memory are critical factors.
This section is a bit misleading since we added optimizations in Lucene 8 to speed up queries that contain stop words. Removing stop words is also challenging for other token filters like synonym_graph
and synonym
so now that we are able to run queries with stop words efficiently I am not sure we should add such advice in the docs. We could maybe just mention the disk space savings and add a note regarding the faster search if the total hits that match the query is not requested (making the removal of stop words not necessary) ?
for that language. | ||
|
||
Stop words in other supported languages can be accessed in Lucene: | ||
https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis[Stopwords] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of stopwords for each language is different from snowball. You can find the entire list here:
https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/analysis/Analysis.java#L115
Closes #16561