Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query about stop words and the relationship to stemmer. #194

Closed
rluiten opened this issue Dec 25, 2015 · 3 comments
Closed

Query about stop words and the relationship to stemmer. #194

rluiten opened this issue Dec 25, 2015 · 3 comments

Comments

@rluiten
Copy link

rluiten commented Dec 25, 2015

Firstly just want to say lunr.js is very cool.

I have been implementing lunr in Elm language.
Have run across something I found a bit odd, and thought I would check with you as it affects lunr.js as well as far as i know.

In a test I noticed that one of the stop words was not pre stemmed, given my reading of lunr.js this means those stop words can never be applied ?

3 Examples of the 24 I found against my code base.

  • "because"; got: "becaus"
  • "however"; got: "howev"
  • "likely"; got: "like"

I can supply the word miss matches but several wont apply to lunr because of the changes to my stemmer mentioned below.

I am pretty sure that lunr.js doe snot does not stem the stop words list before using them as stop word filters.
If you agree this is likely an issues maybe running the default stop words and any user supplied ones through the stemmer to make the filter would address the problem ?

I don't believe we can pre stem the words in case they add additional transforms to the token processing.

I wrote a quick test to check the default stop words against the stemmed version of them and found a bunch that dont match, now my stemmer has drifted away from the lunr one because I decided to make it pass the tests avaialble on the porter stemmer page at http://tartarus.org/martin/PorterStemmer/. Namely the voc.txt and output.txt files I turned into a big slow test to check occasionally.

Cheers Robin.

@olivernn
Copy link
Owner

The list of stop words has not been stemmed. Lunr applies the stop word filter before stemming the tokens. This could mean that we would admit tokens into the index that would share a stem with a stop word. In practise, this isn't a problem, I don't think there is an issue with any common words sharing a stem with because or however. likely seems more, ahem, likely, to share a stem, but like is also in the stop word list.

I think, in general, the idea is that the stop words must match exactly, so I don't think this is an issue. If you can think of cases where it would cause problems please let me know.

With regards to the stemmer, lunr's stemmer is also an implementation (or copy) of the PorterStemmer you reference. There are tests also, which I assumed I also got from the PorterStemmer project, though I can't remember. Do these tests not match what you have?

p.s. The elm version of lunr is really cool!

@rluiten
Copy link
Author

rluiten commented Jan 20, 2016

I mentioned those 3 words as I ran into hiccups with all 3 of them

Specifically the porter stem of "because" is "becaus" and if the stemmer is not applied to the stop word list before it creates the stop word filter then any time the word "because" is find in article to index it won't get blocked because word from article is "becaus" which won't match "because" unstemmed stop word.

However from your comment it is likely I just miss read the lunr.js code, and missed where you applied the stemmer to the stop words before creating the filter list.

As to the stemmer differences, this was the only one afaik it is in "Step1 c"

  • Difference in lunr.js to porter stemmer
    • lunr stem "lay" == "lay"
    • lunr stem "try" == "tri"
  • Porter stemmer test fixture voc.txt and output.txt contain the following.
    • stem "lay" == "lai"
    • stem "try" == "try"

@rluiten
Copy link
Author

rluiten commented Apr 30, 2017

A brief follow up I just got a nice bug report that demonstrated a clear issue with running the stemmer before applying those stop words. So I have changed my behavior to apply filters before stemmer.

@rluiten rluiten closed this as completed Apr 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants