Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added default stopwords and link to Lucene #41173

Closed

Conversation

kat257
Copy link
Contributor

@kat257 kat257 commented Apr 12, 2019

Closes #16561

@kat257 kat257 added >docs General docs changes :Search Relevance/Analysis How text is split into tokens labels Apr 12, 2019
@kat257 kat257 requested a review from debadair April 12, 2019 23:41
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@kat257 kat257 requested review from jrodewig and removed request for debadair May 21, 2019 15:23
@kat257 kat257 self-assigned this May 21, 2019
list of stop words. Defaults to `_english_`.

`stopwords_path`::

The path to a file containing stop words. This path is relative to the
Elasticsearch `config` directory.

The default English stopwords used in Elasticsearch are:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still the case?

I know Lucene has a file of EN stop words here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt

Might be worth checking with an engineer in @elastic/es-search.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also may make more sense to move this section under the stopwords param. Or link to this section from there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list looks good, the file you linked is for the snowball filters and analyzers but we don't use them in the stop filter. The default list can be found here: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L46

for that language.

Stop words in other supported languages can be accessed in Lucene:
https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis[Stopwords]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of stopwords for each language is different from snowball. You can find the entire list here:
https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/analysis/Analysis.java#L115

words, you can reduce the size of an index and increase performance, which is
important when disk space and memory are critical factors.



[float]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to add an anchor here like [[stop-analyzer-example-output]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop words are common words set not to be indexed. By removing a list of common
words, you can reduce the size of an index and increase performance, which is
important when disk space and memory are critical factors.

This section is a bit misleading since we added optimizations in Lucene 8 to speed up queries that contain stop words. Removing stop words is also challenging for other token filters like synonym_graph and synonym so now that we are able to run queries with stop words efficiently I am not sure we should add such advice in the docs. We could maybe just mention the disk space savings and add a note regarding the faster search if the total hits that match the query is not requested (making the removal of stop words not necessary) ?

they, this, to, was, will, with


==== Stop words in supported languages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an anchor here as well.

Each language analyzer defaults to using the appropriate stopwords list
for that language.

Stop words in other supported languages can be accessed in Lucene:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passive voice

list of stop words. Defaults to `_english_`.

`stopwords_path`::

The path to a file containing stop words. This path is relative to the
Elasticsearch `config` directory.

The default English stopwords used in Elasticsearch are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passive voice

Copy link
Contributor

@jrodewig jrodewig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if you're confident the default English stop words are accurate.

Made some minor suggestions and comments, but those are not blockers.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kat257, I left some additional comments

list of stop words. Defaults to `_english_`.

`stopwords_path`::

The path to a file containing stop words. This path is relative to the
Elasticsearch `config` directory.

The default English stopwords used in Elasticsearch are:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list looks good, the file you linked is for the snowball filters and analyzers but we don't use them in the stop filter. The default list can be found here: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L46

words, you can reduce the size of an index and increase performance, which is
important when disk space and memory are critical factors.



[float]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop words are common words set not to be indexed. By removing a list of common
words, you can reduce the size of an index and increase performance, which is
important when disk space and memory are critical factors.

This section is a bit misleading since we added optimizations in Lucene 8 to speed up queries that contain stop words. Removing stop words is also challenging for other token filters like synonym_graph and synonym so now that we are able to run queries with stop words efficiently I am not sure we should add such advice in the docs. We could maybe just mention the disk space savings and add a note regarding the faster search if the total hits that match the query is not requested (making the removal of stop words not necessary) ?

for that language.

Stop words in other supported languages can be accessed in Lucene:
https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis[Stopwords]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of stopwords for each language is different from snowball. You can find the entire list here:
https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/analysis/Analysis.java#L115

@jimczi jimczi changed the title [DOCS] Added default stopwords and link to Lucene. Closes #16561 Added default stopwords and link to Lucene May 27, 2019
@jrodewig jrodewig self-requested a review August 1, 2019 14:06
@jrodewig jrodewig closed this Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No documentation on what stopwords are in each set
4 participants