-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query about stop words and the relationship to stemmer. #194
Comments
The list of stop words has not been stemmed. Lunr applies the stop word filter before stemming the tokens. This could mean that we would admit tokens into the index that would share a stem with a stop word. In practise, this isn't a problem, I don't think there is an issue with any common words sharing a stem with I think, in general, the idea is that the stop words must match exactly, so I don't think this is an issue. If you can think of cases where it would cause problems please let me know. With regards to the stemmer, lunr's stemmer is also an implementation (or copy) of the PorterStemmer you reference. There are tests also, which I assumed I also got from the PorterStemmer project, though I can't remember. Do these tests not match what you have? p.s. The elm version of lunr is really cool! |
I mentioned those 3 words as I ran into hiccups with all 3 of them Specifically the porter stem of "because" is "becaus" and if the stemmer is not applied to the stop word list before it creates the stop word filter then any time the word "because" is find in article to index it won't get blocked because word from article is "becaus" which won't match "because" unstemmed stop word. However from your comment it is likely I just miss read the lunr.js code, and missed where you applied the stemmer to the stop words before creating the filter list. As to the stemmer differences, this was the only one afaik it is in "Step1 c"
|
A brief follow up I just got a nice bug report that demonstrated a clear issue with running the stemmer before applying those stop words. So I have changed my behavior to apply filters before stemmer. |
Firstly just want to say lunr.js is very cool.
I have been implementing lunr in Elm language.
Have run across something I found a bit odd, and thought I would check with you as it affects lunr.js as well as far as i know.
In a test I noticed that one of the stop words was not pre stemmed, given my reading of lunr.js this means those stop words can never be applied ?
3 Examples of the 24 I found against my code base.
I can supply the word miss matches but several wont apply to lunr because of the changes to my stemmer mentioned below.
I am pretty sure that lunr.js doe snot does not stem the stop words list before using them as stop word filters.
If you agree this is likely an issues maybe running the default stop words and any user supplied ones through the stemmer to make the filter would address the problem ?
I don't believe we can pre stem the words in case they add additional transforms to the token processing.
I wrote a quick test to check the default stop words against the stemmed version of them and found a bunch that dont match, now my stemmer has drifted away from the lunr one because I decided to make it pass the tests avaialble on the porter stemmer page at http://tartarus.org/martin/PorterStemmer/. Namely the voc.txt and output.txt files I turned into a big slow test to check occasionally.
Cheers Robin.
The text was updated successfully, but these errors were encountered: