don't add empty tokens (due to trimming) to the index to avoid errors while searching #166

janeisklar · 2015-07-29T13:33:10Z

The text 'test ???' will lead to the following tokens:
['test', '???']

After trimming this becomes:
['test', '']

The empty string then causes problems while searching the index. This fix avoids retaining the empty string after trimming.

… while searching

janeisklar · 2015-07-29T14:17:43Z

Please ignore the second commit. I'm not sure why it was auto-added to this pull request

olivernn · 2015-08-01T10:06:31Z

Thanks for taking a look at this. Perhaps I'm missing something, but what is the issue here? I was unable to reproduce the error but maybe I'm not understanding what the issue is.

I did the following:

var idx = lunr(function () {
  this.field('foo')
})

idx.add({id: 1, foo: 'foo ???'})

idx.search('foo') // returns the right document
idx.search('???') // returns no documents

idx.tokenStore.length //= 1
idx.corpusTokens.toArray() //= ["foo"]

This is the behaviour I would expect. What behaviour did you expect and what are you seeing?

janeisklar · 2015-08-01T10:14:53Z

The empty token I described lead to an error while searching (input irrelevant). I can't give you the exact error message as I'm not at work right now, but it was something like a null-pointer equivalent while iterating over the tokens.
It could be that we're only seeing this issue because we've removed both the stemming and stop word filter from the pipeline, giving us a different behaviour than what you are seeing.

olivernn · 2015-08-01T11:38:22Z

Ah yes, that's probably it. The stop word filter rejects empty tokens. And so removing it means that those empty tokens make it into the index and then you get the issue you see.

It seems strange to have the stop word filter include the empty token, its not really a word! I think it is safe to have that check in the trimmer and then not have it as part of the stop word filter. Would you mind updating this pull request with the following:

Remove this commit
Remove the empty word from the stop word filter

The language extensions also include the empty token in the stop word filters, but from looking at the code they also remove the trimmer function. I think this is fine though because they cannot use the built in trimmer as it is really focused on English and doesn't work well with non ascii characters.

olivernn · 2015-08-01T11:39:59Z

Also, out of interest, what use case do you have where you need to remove the stemmer and stop word filter? Just trying to understand how you are using lunr and whether there is something lunr can do to make your implementation easier.

janeisklar · 2015-08-01T12:25:37Z

Ah okay, that makes sense. You'll receive another patch within the next couple of days.
We've removed the stop-word filter because we've had some issues with finding some words that were removed. For simplicity, let's just say that the application allows to assemble a book using pre-defined chapters that can be searched for using lunr. One section was named 'in-depth something' and lunr didn't find anything for 'in-', a lot for 'in-d' and I think you had to type in a lot more characters to actually limit the search result to 'in-depth'. For our use case I think we're better off without the stop word filter, as the search is not used for large documents, but rather short section headings.

As for the stemmer, I believe the issue was with words like 'companies' where you'd find the entry when typing up to 'compan', but would suddenly not find it anymore if you appended an 'i'. Of course typing in the whole word you'll find it again.

Since the results are shown instantaneously the user experience was a bit weird when the entry you were searching for appears, disappears and appears again as you type. We have therefore removed it as well, as our users know what they are searching for anyway - it's just meant to speed up the process of finding something for them (there are a lot of headings in the index).

Regarding the stemming issue: I don't exactly know how you could improve upon it, as you can't really know that 'compani' will be 'companies' when the whole word is typed and therefore can't stem it yet, can you?

olivernn · 2015-08-01T15:37:05Z

Yeah I think the stemmer has caused issues like this for other people, I don't really have a good suggest to fix this I'm afraid.

This reverts commit 7db31ac.

janeisklar · 2015-08-04T07:49:48Z

Please have another look at the changes I have committed.

olivernn · 2015-08-09T10:26:46Z

I've pushed your changes in the latest version of lunr. Thanks again for you help!

janeisklar added 2 commits July 29, 2015 15:28

don't add empty tokens (due to trimming) to the index to avoid errors…

fccd109

… while searching

release build 0.5.11-mp1

7db31ac

janeisklar added 2 commits August 4, 2015 08:45

Revert "release build 0.5.11-mp1"

0461ca1

This reverts commit 7db31ac.

removed empty string from the stop word filter

103fb43

olivernn closed this Aug 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't add empty tokens (due to trimming) to the index to avoid errors while searching #166

don't add empty tokens (due to trimming) to the index to avoid errors while searching #166

janeisklar commented Jul 29, 2015

janeisklar commented Jul 29, 2015

olivernn commented Aug 1, 2015

janeisklar commented Aug 1, 2015

olivernn commented Aug 1, 2015

olivernn commented Aug 1, 2015

janeisklar commented Aug 1, 2015

olivernn commented Aug 1, 2015

janeisklar commented Aug 4, 2015

olivernn commented Aug 9, 2015

don't add empty tokens (due to trimming) to the index to avoid errors while searching #166

don't add empty tokens (due to trimming) to the index to avoid errors while searching #166

Conversation

janeisklar commented Jul 29, 2015

janeisklar commented Jul 29, 2015

olivernn commented Aug 1, 2015

janeisklar commented Aug 1, 2015

olivernn commented Aug 1, 2015

olivernn commented Aug 1, 2015

janeisklar commented Aug 1, 2015

olivernn commented Aug 1, 2015

janeisklar commented Aug 4, 2015

olivernn commented Aug 9, 2015