Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't index or search for "one" (or: maybe filter stopwords before stemming?) #10

Closed
tgecho opened this issue Apr 29, 2017 · 4 comments
Closed

Comments

@tgecho
Copy link

tgecho commented Apr 29, 2017

In my initial testing, I got an error when I searched for the text "one". As far as I can tell, it looks like processTokens transforms the tokens before filtering them. In this case, "one" becomes "on" and is filtered out as a stopword.

I don't know much about this stuff, but it would seem to make sense to filter out stopwords first? This seems to be what lunr.js does according to this comment

Example: https://ellie-app.com/346jy8VmsMXa1/0

@rluiten
Copy link
Owner

rluiten commented Apr 29, 2017

That is a good example of the problem I believe, thank you.

I do believe I will need to modify the behavior to filter the original stop word lists as per your example. I was looking for some clarification at the time as you found in that comment, but I never got beyond that point.

I am not to sure at moment on how much work there will be in the change required but I hope to explore it more in the next few days.

@tgecho tgecho changed the title Can't index or search for "one", or filter stopwords before stemming? Can't index or search for "one" (or: maybe filter stopwords before stemming?) Apr 30, 2017
@tgecho
Copy link
Author

tgecho commented Apr 30, 2017

One other thing to keep in mind is that the word "one" in the doc isn't even indexed. If you change the doc text to simply "one" an Err "Error after tokenisation there are no terms to index." is returned.

I haven't tried to follow all of the ramifications, so I apologize for speaking from a position of relative ignorance. Might it make sense to combine the TransformFunc and FilterFunc types into a single String -> Maybe String?

This could allow a fully custom single list of transform functions which could filter and/or transform as the user needs. From poking around, it seems that there may be some valid use cases for doing it both ways.

I'm happy to attempt a PR if this sounds interesting/feasible and you don't have much short term.

Thanks for the great library!

rluiten added a commit that referenced this issue Apr 30, 2017
See #10
"Can't index or search for "one" (or: maybe filter stopwords before stemming?)"

Bumped `indexVersion` to `1.1.0` as filter and transform changes could introduce surprising behaviour.

Index now applies Initial Transforms then Filters (stop word filtesr) then Transforms.
The only Initial Transform thats is currently used is `TokenProcessors.trimmer` to remove non word characters prefixes and suffixes.
This initial transform is useful to better match stop word filters.The word "one" will now properly index and be found even though when transformed it matches a default stop word of "on".

In addition some type changes were introduced to make it possible in future to implement loading and possibly saving of older versions of index.
@rluiten
Copy link
Owner

rluiten commented Apr 30, 2017

I have published a version with a fix for the filter behavior 4.0.0.
I briefly tried to see if Ellie had picked up the change already but it had not.
I assume it takes a while for Ellie to get new package versions.

@tgecho
Copy link
Author

tgecho commented May 4, 2017

That did the trick! Looks like Ellie pins the library versions. Here's one with an update to 4.0.0 that confirms the fix: https://ellie-app.com/346jy8VmsMXa1/1

Thank you so much!

@tgecho tgecho closed this as completed May 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants