-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't index or search for "one" (or: maybe filter stopwords before stemming?) #10
Comments
That is a good example of the problem I believe, thank you. I do believe I will need to modify the behavior to filter the original stop word lists as per your example. I was looking for some clarification at the time as you found in that comment, but I never got beyond that point. I am not to sure at moment on how much work there will be in the change required but I hope to explore it more in the next few days. |
One other thing to keep in mind is that the word "one" in the doc isn't even indexed. If you change the doc text to simply "one" an I haven't tried to follow all of the ramifications, so I apologize for speaking from a position of relative ignorance. Might it make sense to combine the This could allow a fully custom single list of transform functions which could filter and/or transform as the user needs. From poking around, it seems that there may be some valid use cases for doing it both ways. I'm happy to attempt a PR if this sounds interesting/feasible and you don't have much short term. Thanks for the great library! |
See #10 "Can't index or search for "one" (or: maybe filter stopwords before stemming?)" Bumped `indexVersion` to `1.1.0` as filter and transform changes could introduce surprising behaviour. Index now applies Initial Transforms then Filters (stop word filtesr) then Transforms. The only Initial Transform thats is currently used is `TokenProcessors.trimmer` to remove non word characters prefixes and suffixes. This initial transform is useful to better match stop word filters.The word "one" will now properly index and be found even though when transformed it matches a default stop word of "on". In addition some type changes were introduced to make it possible in future to implement loading and possibly saving of older versions of index.
I have published a version with a fix for the filter behavior 4.0.0. |
That did the trick! Looks like Ellie pins the library versions. Here's one with an update to 4.0.0 that confirms the fix: https://ellie-app.com/346jy8VmsMXa1/1 Thank you so much! |
In my initial testing, I got an error when I searched for the text "one". As far as I can tell, it looks like
processTokens
transforms the tokens before filtering them. In this case, "one" becomes "on" and is filtered out as a stopword.I don't know much about this stuff, but it would seem to make sense to filter out stopwords first? This seems to be what lunr.js does according to this comment
Example: https://ellie-app.com/346jy8VmsMXa1/0
The text was updated successfully, but these errors were encountered: