Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't add empty tokens (due to trimming) to the index to avoid errors while searching #166

Closed
wants to merge 4 commits into from

Conversation

janeisklar
Copy link
Contributor

The text 'test ???' will lead to the following tokens:
['test', '???']

After trimming this becomes:
['test', '']

The empty string then causes problems while searching the index. This fix avoids retaining the empty string after trimming.

@janeisklar
Copy link
Contributor Author

Please ignore the second commit. I'm not sure why it was auto-added to this pull request

@olivernn
Copy link
Owner

olivernn commented Aug 1, 2015

Thanks for taking a look at this. Perhaps I'm missing something, but what is the issue here? I was unable to reproduce the error but maybe I'm not understanding what the issue is.

I did the following:

var idx = lunr(function () {
  this.field('foo')
})

idx.add({id: 1, foo: 'foo ???'})

idx.search('foo') // returns the right document
idx.search('???') // returns no documents

idx.tokenStore.length //= 1
idx.corpusTokens.toArray() //= ["foo"]

This is the behaviour I would expect. What behaviour did you expect and what are you seeing?

@janeisklar
Copy link
Contributor Author

The empty token I described lead to an error while searching (input irrelevant). I can't give you the exact error message as I'm not at work right now, but it was something like a null-pointer equivalent while iterating over the tokens.
It could be that we're only seeing this issue because we've removed both the stemming and stop word filter from the pipeline, giving us a different behaviour than what you are seeing.

@olivernn
Copy link
Owner

olivernn commented Aug 1, 2015

Ah yes, that's probably it. The stop word filter rejects empty tokens. And so removing it means that those empty tokens make it into the index and then you get the issue you see.

It seems strange to have the stop word filter include the empty token, its not really a word! I think it is safe to have that check in the trimmer and then not have it as part of the stop word filter. Would you mind updating this pull request with the following:

The language extensions also include the empty token in the stop word filters, but from looking at the code they also remove the trimmer function. I think this is fine though because they cannot use the built in trimmer as it is really focused on English and doesn't work well with non ascii characters.

@olivernn
Copy link
Owner

olivernn commented Aug 1, 2015

Also, out of interest, what use case do you have where you need to remove the stemmer and stop word filter? Just trying to understand how you are using lunr and whether there is something lunr can do to make your implementation easier.

@janeisklar
Copy link
Contributor Author

Ah okay, that makes sense. You'll receive another patch within the next couple of days.
We've removed the stop-word filter because we've had some issues with finding some words that were removed. For simplicity, let's just say that the application allows to assemble a book using pre-defined chapters that can be searched for using lunr. One section was named 'in-depth something' and lunr didn't find anything for 'in-', a lot for 'in-d' and I think you had to type in a lot more characters to actually limit the search result to 'in-depth'. For our use case I think we're better off without the stop word filter, as the search is not used for large documents, but rather short section headings.

As for the stemmer, I believe the issue was with words like 'companies' where you'd find the entry when typing up to 'compan', but would suddenly not find it anymore if you appended an 'i'. Of course typing in the whole word you'll find it again.

Since the results are shown instantaneously the user experience was a bit weird when the entry you were searching for appears, disappears and appears again as you type. We have therefore removed it as well, as our users know what they are searching for anyway - it's just meant to speed up the process of finding something for them (there are a lot of headings in the index).

Regarding the stemming issue: I don't exactly know how you could improve upon it, as you can't really know that 'compani' will be 'companies' when the whole word is typed and therefore can't stem it yet, can you?

@olivernn
Copy link
Owner

olivernn commented Aug 1, 2015

Yeah I think the stemmer has caused issues like this for other people, I don't really have a good suggest to fix this I'm afraid.

@janeisklar
Copy link
Contributor Author

Please have another look at the changes I have committed.

@olivernn
Copy link
Owner

olivernn commented Aug 9, 2015

I've pushed your changes in the latest version of lunr. Thanks again for you help!

@olivernn olivernn closed this Aug 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants