-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No documentation on what stopwords are in each set #16561
Comments
Yeah, I've tried to put together a list before, but they're quite difficult to round up in Lucene. |
I can certainly confirm that these are quite hard to round up. We had a client demand the list for all stop words we were using for every language configured. Here is a quick description on what it took for me to find, and prove, that the list of stop words that ES is using come from Maybe this is something I can take if I have some free time? Seems like a great beginner ticket and I have experience doing this already! |
@hjc1710 please do - would be great to have this info in the docs |
We should link to the Lucene docs. |
Pinging @elastic/es-search-aggs |
[docs issue triage] leaving open as this is still relevant. |
Hi, I am a beginner who is looking to contribute. This issue seems fairly straightforward. From what I have looked at so far, it seems like the stop words are coming from a combination of different sources such as lucene snowball and the IR Multilingual Resources at UniNE. I can go ahead and look through all the languages and start compiling a list of words. I have uploaded an example of the arabic stopwords. I cannot obtain a translation for them, but can work towards it. I figured getting a list of all the language's stopwords would be a better first step. Hopefully I am on the right track. Please review the attached file and let me know if this format would satisfy documentation efforts. |
@ScottieL Thanks for offering to contribute. Please feel free to open a PR. As @debadair previously mentioned, I believe that linking to the stop word source will be sufficient. I don't think we need to translate them or necessarily include them verbatim in the Elasticsearch docs. This will let us point users to the stopwords without getting out of sync. |
For instance, there is no documentation of what words are in the
_english_
stop words set.The text was updated successfully, but these errors were encountered: