-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
don't add empty tokens (due to trimming) to the index to avoid errors while searching #166
Conversation
Please ignore the second commit. I'm not sure why it was auto-added to this pull request |
Thanks for taking a look at this. Perhaps I'm missing something, but what is the issue here? I was unable to reproduce the error but maybe I'm not understanding what the issue is. I did the following: var idx = lunr(function () {
this.field('foo')
})
idx.add({id: 1, foo: 'foo ???'})
idx.search('foo') // returns the right document
idx.search('???') // returns no documents
idx.tokenStore.length //= 1
idx.corpusTokens.toArray() //= ["foo"] This is the behaviour I would expect. What behaviour did you expect and what are you seeing? |
The empty token I described lead to an error while searching (input irrelevant). I can't give you the exact error message as I'm not at work right now, but it was something like a null-pointer equivalent while iterating over the tokens. |
Ah yes, that's probably it. The stop word filter rejects empty tokens. And so removing it means that those empty tokens make it into the index and then you get the issue you see. It seems strange to have the stop word filter include the empty token, its not really a word! I think it is safe to have that check in the trimmer and then not have it as part of the stop word filter. Would you mind updating this pull request with the following:
The language extensions also include the empty token in the stop word filters, but from looking at the code they also remove the trimmer function. I think this is fine though because they cannot use the built in trimmer as it is really focused on English and doesn't work well with non ascii characters. |
Also, out of interest, what use case do you have where you need to remove the stemmer and stop word filter? Just trying to understand how you are using lunr and whether there is something lunr can do to make your implementation easier. |
Ah okay, that makes sense. You'll receive another patch within the next couple of days. As for the stemmer, I believe the issue was with words like 'companies' where you'd find the entry when typing up to 'compan', but would suddenly not find it anymore if you appended an 'i'. Of course typing in the whole word you'll find it again. Since the results are shown instantaneously the user experience was a bit weird when the entry you were searching for appears, disappears and appears again as you type. We have therefore removed it as well, as our users know what they are searching for anyway - it's just meant to speed up the process of finding something for them (there are a lot of headings in the index). Regarding the stemming issue: I don't exactly know how you could improve upon it, as you can't really know that 'compani' will be 'companies' when the whole word is typed and therefore can't stem it yet, can you? |
Yeah I think the stemmer has caused issues like this for other people, I don't really have a good suggest to fix this I'm afraid. |
This reverts commit 7db31ac.
Please have another look at the changes I have committed. |
I've pushed your changes in the latest version of lunr. Thanks again for you help! |
The text 'test ???' will lead to the following tokens:
['test', '???']
After trimming this becomes:
['test', '']
The empty string then causes problems while searching the index. This fix avoids retaining the empty string after trimming.