Added default stopwords and link to Lucene #41173

kat257 · 2019-04-12T23:41:14Z

elasticmachine · 2019-04-12T23:41:15Z

Pinging @elastic/es-search

jrodewig · 2019-05-21T16:12:21Z

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

    list of stop words.  Defaults to `_english_`.

 `stopwords_path`::

    The path to a file containing stop words. This path is relative to the
    Elasticsearch `config` directory.

+The default English stopwords used in Elasticsearch are:
+
+    a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,


Is this still the case?

I know Lucene has a file of EN stop words here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt

Might be worth checking with an engineer in @elastic/es-search.

It also may make more sense to move this section under the stopwords param. Or link to this section from there.

The list looks good, the file you linked is for the snowball filters and analyzers but we don't use them in the stop filter. The default list can be found here: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L46

jrodewig · 2019-05-21T16:13:18Z

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

+for that language.
+
+Stop words in other supported languages can be accessed in Lucene:
+https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis[Stopwords]


It might be worth mentioning that some of these are under the snowball path:
https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball

The list of stopwords for each language is different from snowball. You can find the entire list here:
https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/analysis/Analysis.java#L115

jrodewig · 2019-05-21T16:15:38Z

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

+words, you can reduce the size of an index and increase performance, which is
+important when disk space and memory are critical factors.
+
+

 [float]


Probably want to add an anchor here like [[stop-analyzer-example-output]]

Stop words are common words set not to be indexed. By removing a list of common
words, you can reduce the size of an index and increase performance, which is
important when disk space and memory are critical factors.

This section is a bit misleading since we added optimizations in Lucene 8 to speed up queries that contain stop words. Removing stop words is also challenging for other token filters like synonym_graph and synonym so now that we are able to run queries with stop words efficiently I am not sure we should add such advice in the docs. We could maybe just mention the disk space savings and add a note regarding the faster search if the total hits that match the query is not requested (making the removal of stop words not necessary) ?

jrodewig · 2019-05-21T16:19:00Z

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

+    they, this, to, was, will, with
+
+
+==== Stop words in supported languages


Add an anchor here as well.

jrodewig · 2019-05-21T16:20:18Z

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

+Each language analyzer defaults to using the appropriate stopwords list
+for that language.
+
+Stop words in other supported languages can be accessed in Lucene:


Passive voice

jrodewig · 2019-05-21T16:20:38Z

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

    list of stop words.  Defaults to `_english_`.

 `stopwords_path`::

    The path to a file containing stop words. This path is relative to the
    Elasticsearch `config` directory.

+The default English stopwords used in Elasticsearch are:


Passive voice

jrodewig

LGTM if you're confident the default English stop words are accurate.

Made some minor suggestions and comments, but those are not blockers.

jimczi

Thanks @kat257, I left some additional comments

jimczi · 2019-05-21T20:14:01Z

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

    list of stop words.  Defaults to `_english_`.

 `stopwords_path`::

    The path to a file containing stop words. This path is relative to the
    Elasticsearch `config` directory.

+The default English stopwords used in Elasticsearch are:
+
+    a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,


The list looks good, the file you linked is for the snowball filters and analyzers but we don't use them in the stop filter. The default list can be found here: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L46

jimczi · 2019-05-21T20:20:21Z

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

+words, you can reduce the size of an index and increase performance, which is
+important when disk space and memory are critical factors.
+
+

 [float]


Stop words are common words set not to be indexed. By removing a list of common
words, you can reduce the size of an index and increase performance, which is
important when disk space and memory are critical factors.

This section is a bit misleading since we added optimizations in Lucene 8 to speed up queries that contain stop words. Removing stop words is also challenging for other token filters like synonym_graph and synonym so now that we are able to run queries with stop words efficiently I am not sure we should add such advice in the docs. We could maybe just mention the disk space savings and add a note regarding the faster search if the total hits that match the query is not requested (making the removal of stop words not necessary) ?

jimczi · 2019-05-21T20:23:36Z

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

+for that language.
+
+Stop words in other supported languages can be accessed in Lucene:
+https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis[Stopwords]


The list of stopwords for each language is different from snowball. You can find the entire list here:
https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/analysis/Analysis.java#L115

[DOCS] Added default stopwords and link to Lucene. Closes elastic#16561

0cbb8f5

kat257 added >docs General docs changes :Search Relevance/Analysis How text is split into tokens labels Apr 12, 2019

kat257 requested a review from debadair April 12, 2019 23:41

kat257 requested review from jrodewig and removed request for debadair May 21, 2019 15:23

kat257 self-assigned this May 21, 2019

jrodewig reviewed May 21, 2019

View reviewed changes

jrodewig approved these changes May 21, 2019

View reviewed changes

jimczi reviewed May 21, 2019

View reviewed changes

jimczi changed the title ~~[DOCS] Added default stopwords and link to Lucene. Closes #16561~~ Added default stopwords and link to Lucene May 27, 2019

jrodewig self-requested a review August 1, 2019 14:06

jrodewig closed this Aug 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added default stopwords and link to Lucene #41173

Added default stopwords and link to Lucene #41173

kat257 commented Apr 12, 2019 •

edited by jimczi

Loading

elasticmachine commented Apr 12, 2019

jrodewig May 21, 2019

jrodewig May 21, 2019

jimczi May 21, 2019

jrodewig May 21, 2019

jimczi May 21, 2019

jrodewig May 21, 2019

jimczi May 21, 2019

jrodewig May 21, 2019

jrodewig May 21, 2019

jrodewig May 21, 2019

jrodewig left a comment

jimczi left a comment

jimczi May 21, 2019

jimczi May 21, 2019

jimczi May 21, 2019

		they, this, to, was, will, with


		==== Stop words in supported languages

Added default stopwords and link to Lucene #41173

Added default stopwords and link to Lucene #41173

Conversation

kat257 commented Apr 12, 2019 • edited by jimczi Loading

elasticmachine commented Apr 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrodewig left a comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kat257 commented Apr 12, 2019 •

edited by jimczi

Loading