-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Searching words with ending wildcards returns inconsistent results #256
Comments
This looks like a problem caused by stemming. When a search is made without a wildcard the search terms are passed through the search pipeline, which by default includes a stemmer. When a wildcard is part of the search term lunr does not pass the term through the pipeline. This is because the term with wildcard might not be a full word and therefore stemming would give incorrect results. The reason why "module" shows this problem is that because it is stemmed to "modul" while "pilot" is stemmed to "pilot", you can try this on the demo site console: idx.pipeline.runString("pilot") //= ["pilot"]
idx.pipeline.runString("module") //-["modul"] So, when searching for "module*" lunr is looking in the index for anything beginning with "module", but there is nothing found, since all the "module" terms in the documents have actually been indexed as "modul". Again, you can inspect the index to see this: idx.invertedIndex["modul"] And trying to find the token in the set of known tokens: idx.tokenSet.intersect(lunr.TokenSet.fromString("module*")).toArray() //= []
idx.tokenSet.intersect(lunr.TokenSet.fromString("modul*")).toArray() //= ["modul"] I'm not sure the best way to handle this at the moment, I'll have to think about how wildcards and stemming interact before proposing a solution. |
@olivernn - thanks for looking into this. (and thanks for lunr!) I noticed this when we were trying to update https://developers.arcgis.com/javascript/latest/sample-code/ from lunr v.0.7.2 to v2.0.0. At 0.7.2 there wasn't a need for a wildcard - and we like that behavior. I don't fully understand the stemming/index logic, but is there a way to get "back" the behavior from 0.7.2 where user didn't need to add wildcards in order to get "starts with" functionality? |
What I noticed from peoples use case of lunr is that, for typeahead style search, the automatic wildcard could give nice results, as shown on your site. However it would frequently cause unexpected results, just take a look through some of the closed issues. I was thinking about what the best way to express a query for typeahead search might be, there obviously needs to be a component searching for the beginning of a string, but it should also look for exact matches. Perhaps also allow for some fuzzy matching too? All of the above are possible with lunr. I would advise looking into the Below is an example of what I was thinking for typeahead search: idx.query(function (q) {
// look for an exact match and apply a large positive boost
q.term(queryTerm, { usePipeline: true, boost: 100 })
// look for terms that match the beginning of this queryTerm and apply a medium boost
q.term(queryTerm + "*", { usePipeline: false, boost: 10 })
// look for terms that match with an edit distance of 2 and apply a small boost
q.term(queryTerm, { usePipeline: false, editDistance: 2, boost: 1 })
}) The only slight wrinkle is having to manually append a wildcard to the query term, perhaps this should be an option, e.g. You could express this within a query string and the idx.search("${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2") |
|
I'm also trying to search with wildcards. For example, I have a text that contains the word "notifications" and I would like the search to yield about the same number of results while the user is writing the term in the input field. At the moment, I tried the I also tried to add a wildcard before and after the value Am I missing something ? |
There is definitely a bug that is causing the I need to spend a bit of time trying to debug this and figure out what is going on. |
I've pushed a change that should fix the "duplicate index" error, please try version 2.0.3 and let me know if there are any issues. |
That fix it, thanks a lot. |
@IYCI glad that its working for you now, thanks for taking the time to report the problem and testing the fix. @bsvensson does that search query I suggested work as expected for you? |
Thanks, it fixes the error. Here are the different queries followed by the number of results they return.
It's an odd behaviour that in the middle of the word notification it returns no results ? |
@et1421 its difficult to say without access to the dataset you are searching, can you share the index? You can see which tokens lunr is finding for those search terms manually: idx.tokenSet.intersect(new lunr.TokenSet.fromString("notification")) // for the exact match
idx.tokenSet.intersect(new lunr.TokenSet.fromString("notification*")) // for the prefix search
idx.tokenSet.intersect(new lunr.TokenSet.fromFuzzyString("notification", 2)) // for the fuzzy search It might give you some clues as to why some of those are not matching, while others are. If I had to guess, I'd say that the fuzzy search is finding something that the prefix search is not, but only after a certain length. You could try reducing the amount of fuzz, perhaps 1 is a better value, or by removing it entirely. It might be easier to understand by actually looking at what the fuzzy string is expanded too, and also give you an idea the impact it has on search performance: new lunr.TokenSet.fromFuzzyString("notification", 2).toArray() I see that get expanded into 1204 different strings, with an edit distance of 1 this is only 52:
|
You might find that the fuzzy search isn't required to get good typeahead search, it looks like it might cause some unexpected results. When you do settle on something that gives good results please do report back so others can benefit from your investigation. If there is a good general approach it is something that lunr could support more directly, i.e. a specific method on |
@et1421 I'm going to close this issue now, if you can provide more details on the results you were seeing (specifically being able to provide the index) then please re-open this issue and I'll take a further look. |
@bsvensson I've created a pull request with a proposal for easier support for adding wildcards to programatic queries. Would be interested to hear if this would make the implementation of your use case any cleaner. |
I've been having a problem with wildcard searches on version 2.1.3, which seems to be related to this issue. Wildcard searches will return results up to a point, and then stop returning them. For example, I'm building an index using the following documents... var documents = [
{id: 1, text: 'critical stuff'},
{id: 2, text: 'test'}
]; If I run a search for |
@larskendall without seeing how you set up your index its difficult to say for sure, but that looks like the result of stemming. "critical" stems to "critic" you can test this out with the following snippet: idx.pipeline.runString("critical") Further up in this thread there are a couple of suggestions for ways to express searches that will lead to the kind of results it seems you are expecting. An alternative is to disable the stemmer at build time and search time: var idx = lunr(function() {
//...snip...
this.pipeline.remove(lunr.stemmer)
this.searchPipeline.remove(lunr.stemmer)
//...snip...
// add documents here
}) |
@olivernn Looks like that did the trick! Thanks so much for the quick response! |
Thanks! This worked well for me. However it should be ` instead of ", like this:
|
Note I believe ^ may cause issues if |
For example, on https://olivernn.github.io/moonwalkers/, both
pilot
andpilot*
will return the results I expect. However,module*
returns nothing (whilemodule
works as I expect).The text was updated successfully, but these errors were encountered: