Can't index or search for "one" (or: maybe filter stopwords before stemming?) #10

tgecho · 2017-04-29T20:01:44Z

In my initial testing, I got an error when I searched for the text "one". As far as I can tell, it looks like processTokens transforms the tokens before filtering them. In this case, "one" becomes "on" and is filtered out as a stopword.

I don't know much about this stuff, but it would seem to make sense to filter out stopwords first? This seems to be what lunr.js does according to this comment

Example: https://ellie-app.com/346jy8VmsMXa1/0

The text was updated successfully, but these errors were encountered:

rluiten · 2017-04-29T23:51:55Z

That is a good example of the problem I believe, thank you.

I do believe I will need to modify the behavior to filter the original stop word lists as per your example. I was looking for some clarification at the time as you found in that comment, but I never got beyond that point.

I am not to sure at moment on how much work there will be in the change required but I hope to explore it more in the next few days.

tgecho · 2017-04-30T00:24:17Z

One other thing to keep in mind is that the word "one" in the doc isn't even indexed. If you change the doc text to simply "one" an Err "Error after tokenisation there are no terms to index." is returned.

I haven't tried to follow all of the ramifications, so I apologize for speaking from a position of relative ignorance. Might it make sense to combine the TransformFunc and FilterFunc types into a single String -> Maybe String?

This could allow a fully custom single list of transform functions which could filter and/or transform as the user needs. From poking around, it seems that there may be some valid use cases for doing it both ways.

I'm happy to attempt a PR if this sounds interesting/feasible and you don't have much short term.

Thanks for the great library!

See #10 "Can't index or search for "one" (or: maybe filter stopwords before stemming?)" Bumped `indexVersion` to `1.1.0` as filter and transform changes could introduce surprising behaviour. Index now applies Initial Transforms then Filters (stop word filtesr) then Transforms. The only Initial Transform thats is currently used is `TokenProcessors.trimmer` to remove non word characters prefixes and suffixes. This initial transform is useful to better match stop word filters.The word "one" will now properly index and be found even though when transformed it matches a default stop word of "on". In addition some type changes were introduced to make it possible in future to implement loading and possibly saving of older versions of index.

rluiten · 2017-04-30T12:26:49Z

I have published a version with a fix for the filter behavior 4.0.0.
I briefly tried to see if Ellie had picked up the change already but it had not.
I assume it takes a while for Ellie to get new package versions.

tgecho · 2017-05-04T00:16:08Z

That did the trick! Looks like Ellie pins the library versions. Here's one with an update to 4.0.0 that confirms the fix: https://ellie-app.com/346jy8VmsMXa1/1

Thank you so much!

tgecho changed the title ~~Can't index or search for "one", or filter stopwords before stemming?~~ Can't index or search for "one" (or: maybe filter stopwords before stemming?) Apr 30, 2017

tgecho closed this as completed May 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't index or search for "one" (or: maybe filter stopwords before stemming?) #10

Can't index or search for "one" (or: maybe filter stopwords before stemming?) #10

tgecho commented Apr 29, 2017 •

edited

Loading

rluiten commented Apr 29, 2017 •

edited

Loading

tgecho commented Apr 30, 2017

rluiten commented Apr 30, 2017

tgecho commented May 4, 2017

Can't index or search for "one" (or: maybe filter stopwords before stemming?) #10

Can't index or search for "one" (or: maybe filter stopwords before stemming?) #10

Comments

tgecho commented Apr 29, 2017 • edited Loading

rluiten commented Apr 29, 2017 • edited Loading

tgecho commented Apr 30, 2017

rluiten commented Apr 30, 2017

tgecho commented May 4, 2017

tgecho commented Apr 29, 2017 •

edited

Loading

rluiten commented Apr 29, 2017 •

edited

Loading