Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert stop words filter from SortedSet/Array to Object #170

Closed
wants to merge 2 commits into from

Conversation

bvaughn
Copy link
Contributor

@bvaughn bvaughn commented Aug 2, 2015

The stop words filter is currently implemented using a SortedSet and indexOf method. A faster method would be to use an Object and its keys directly.

Here is a JS perf demonstrating the performance gains:
http://jsperf.com/lunr-stop-words-object

Existing stop words tests pass after this change.

… to Object. All stop words tests pass after this change.
@olivernn
Copy link
Owner

olivernn commented Aug 3, 2015

This looks like a good start, though there are a couple of things that need to be handled with care. JavaScript objects are not great at being hashes. In this case anything on Object.prototype would be considered a stop word.

The best way to fix this is by creating the stopWords object with Object.create(null), however, as far as I'm aware this is not possible to shim.

The second way, and this is the way the idfCache is implemented, is to prefix all keys with some character to avoid any clashes with builtins. I've updated the benchmarks with a change like this. Might be worth running them a few times as Chrome seems to be slower now?

I think it might also be worth having this available as a function that creates stopWordFilters, just to make it simple for end users to provide their own stop word lists easily, something like:

lunr.stopWordFilter = lunr.stopWordFilterBuilder(["long", "list", "of", "stop", "words"])

What do you think?

@bvaughn
Copy link
Contributor Author

bvaughn commented Aug 3, 2015

Ah, that's a fair point.

I wonder if the reason for the slow-down in the updated example is the string concatenation?

I've added a newer version of the benchmark that removes the string concatenation while (I think) also addressing your Object.prototype concerns. Chrome performance is back to being in the green.

@bvaughn
Copy link
Contributor Author

bvaughn commented Aug 3, 2015

PS. If you are satisfied with the updated approach (demonstrated in the JS Perf test) I'll make the corresponding changes to my PR. :)

@bvaughn
Copy link
Contributor Author

bvaughn commented Aug 4, 2015

By the way, I've updated my pull request and added an additional test to cover the Object.prototype concerns you mentioned.

@olivernn
Copy link
Owner

olivernn commented Aug 9, 2015

Your changes are in the latest release of lunr. I added a small change to quote all the keys of the object. Thanks again for your work on this one!

@olivernn olivernn closed this Aug 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants