Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No documentation on what stopwords are in each set #16561

Closed
soaxelbrooke opened this issue Feb 9, 2016 · 9 comments · Fixed by #53059
Closed

No documentation on what stopwords are in each set #16561

soaxelbrooke opened this issue Feb 9, 2016 · 9 comments · Fixed by #53059
Labels
>docs General docs changes good first issue low hanging fruit :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@soaxelbrooke
Copy link

For instance, there is no documentation of what words are in the _english_ stop words set.

@nik9000 nik9000 added the >docs General docs changes label Feb 9, 2016
@clintongormley
Copy link
Contributor

Yeah, I've tried to put together a list before, but they're quite difficult to round up in Lucene.

@clintongormley clintongormley added the help wanted adoptme label Feb 13, 2016
@hjc
Copy link

hjc commented Feb 15, 2016

I can certainly confirm that these are quite hard to round up. We had a client demand the list for all stop words we were using for every language configured. Here is a quick description on what it took for me to find, and prove, that the list of stop words that ES is using come from org.apache.lucene.analysis.snowball and where to find the files. As you can see, it wasn't fun.

Maybe this is something I can take if I have some free time? Seems like a great beginner ticket and I have experience doing this already!

@clintongormley
Copy link
Contributor

@hjc1710 please do - would be great to have this info in the docs

@debadair debadair added the good first issue low hanging fruit label Mar 17, 2018
@debadair
Copy link
Contributor

We should link to the Lucene docs.

@colings86 colings86 added the :Search Relevance/Analysis How text is split into tokens label Apr 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jrodewig
Copy link
Contributor

jrodewig commented Oct 4, 2019

[docs issue triage]

leaving open as this is still relevant.

@ScottieL
Copy link
Contributor

ScottieL commented Oct 8, 2019

Hi, I am a beginner who is looking to contribute. This issue seems fairly straightforward. From what I have looked at so far, it seems like the stop words are coming from a combination of different sources such as lucene snowball and the IR Multilingual Resources at UniNE.

french

arabic

I can go ahead and look through all the languages and start compiling a list of words. I have uploaded an example of the arabic stopwords. I cannot obtain a translation for them, but can work towards it. I figured getting a list of all the language's stopwords would be a better first step. Hopefully I am on the right track. Please review the attached file and let me know if this format would satisfy documentation efforts.

arabic_stop.txt

@jrodewig
Copy link
Contributor

jrodewig commented Oct 8, 2019

@ScottieL Thanks for offering to contribute. Please feel free to open a PR.

As @debadair previously mentioned, I believe that linking to the stop word source will be sufficient. I don't think we need to translate them or necessarily include them verbatim in the Elasticsearch docs.

This will let us point users to the stopwords without getting out of sync.

ScottieL added a commit to ScottieL/elasticsearch that referenced this issue Oct 14, 2019
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes good first issue low hanging fruit :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.