Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching words with ending wildcards returns inconsistent results #256

Closed
bsvensson opened this issue Apr 14, 2017 · 20 comments
Closed

Searching words with ending wildcards returns inconsistent results #256

bsvensson opened this issue Apr 14, 2017 · 20 comments

Comments

@bsvensson
Copy link

For example, on https://olivernn.github.io/moonwalkers/, both pilot and pilot* will return the results I expect. However, module* returns nothing (while module works as I expect).

@olivernn
Copy link
Owner

This looks like a problem caused by stemming.

When a search is made without a wildcard the search terms are passed through the search pipeline, which by default includes a stemmer. When a wildcard is part of the search term lunr does not pass the term through the pipeline. This is because the term with wildcard might not be a full word and therefore stemming would give incorrect results.

The reason why "module" shows this problem is that because it is stemmed to "modul" while "pilot" is stemmed to "pilot", you can try this on the demo site console:

idx.pipeline.runString("pilot") //= ["pilot"]
idx.pipeline.runString("module") //-["modul"]

So, when searching for "module*" lunr is looking in the index for anything beginning with "module", but there is nothing found, since all the "module" terms in the documents have actually been indexed as "modul". Again, you can inspect the index to see this:

idx.invertedIndex["modul"]

And trying to find the token in the set of known tokens:

idx.tokenSet.intersect(lunr.TokenSet.fromString("module*")).toArray() //= []
idx.tokenSet.intersect(lunr.TokenSet.fromString("modul*")).toArray() //= ["modul"]

I'm not sure the best way to handle this at the moment, I'll have to think about how wildcards and stemming interact before proposing a solution.

@bsvensson
Copy link
Author

@olivernn - thanks for looking into this. (and thanks for lunr!)

I noticed this when we were trying to update https://developers.arcgis.com/javascript/latest/sample-code/ from lunr v.0.7.2 to v2.0.0. At 0.7.2 there wasn't a need for a wildcard - and we like that behavior.

I don't fully understand the stemming/index logic, but is there a way to get "back" the behavior from 0.7.2 where user didn't need to add wildcards in order to get "starts with" functionality?

@olivernn
Copy link
Owner

What I noticed from peoples use case of lunr is that, for typeahead style search, the automatic wildcard could give nice results, as shown on your site. However it would frequently cause unexpected results, just take a look through some of the closed issues.

I was thinking about what the best way to express a query for typeahead search might be, there obviously needs to be a component searching for the beginning of a string, but it should also look for exact matches. Perhaps also allow for some fuzzy matching too?

All of the above are possible with lunr. I would advise looking into the lunr.Index#query method, it is intended to be used for building queries programatically (it is used internally by lunr.Index#search).

Below is an example of what I was thinking for typeahead search:

idx.query(function (q) {
  // look for an exact match and apply a large positive boost
  q.term(queryTerm, { usePipeline: true, boost: 100 })

  // look for terms that match the beginning of this queryTerm and apply a medium boost
  q.term(queryTerm + "*", { usePipeline: false, boost: 10 })

  // look for terms that match with an edit distance of 2 and apply a small boost
  q.term(queryTerm, { usePipeline: false, editDistance: 2, boost: 1 })
})

The only slight wrinkle is having to manually append a wildcard to the query term, perhaps this should be an option, e.g. wildcard with the values trailing | leading | wrapped | none, I'll have a think about it.

You could express this within a query string and the search method like this if you want to try things out:

idx.search("${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2")

@IYCI
Copy link

IYCI commented Apr 20, 2017

idx.search("${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2")
Running this gives me duplicate index error in console
Not sure If I'm using it correctly, but I get duplicate index every time I try to search with multiple terms and fuzzy matches at the same time. Example: id:${queryString}^10 (${queryString}~1)

@et1421
Copy link

et1421 commented Apr 21, 2017

I'm also trying to search with wildcards. For example, I have a text that contains the word "notifications" and I would like the search to yield about the same number of results while the user is writing the term in the input field. At the moment, notific gives about 13 results, while if the user adds a letter and writes notifica no results are return.

I tried the idx.search("${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2") suggestion and it also gives me duplicate index error.

I also tried to add a wildcard before and after the value *notifica* thinking it would solved the problem, but it didn't change anything.

Am I missing something ?

@olivernn
Copy link
Owner

There is definitely a bug that is causing the duplicate index error. It seems to be when a search query term is expanded into a term that is already being considered as part of the index. That said I still don't fully understand it as doing a search for foo foo foo does not trigger the bug.

I need to spend a bit of time trying to debug this and figure out what is going on.

@olivernn
Copy link
Owner

I've pushed a change that should fix the "duplicate index" error, please try version 2.0.3 and let me know if there are any issues.

@IYCI
Copy link

IYCI commented Apr 25, 2017

That fix it, thanks a lot.

@olivernn
Copy link
Owner

@IYCI glad that its working for you now, thanks for taking the time to report the problem and testing the fix.

@bsvensson does that search query I suggested work as expected for you?

@et1421
Copy link

et1421 commented Apr 25, 2017

Thanks, it fixes the error.
It still doesn't return what I would expect though.

Here are the different queries followed by the number of results they return.

noti^100noti*^10noti~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notif^100notif*^10notif~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notifi^100notifi*^10notifi~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notific^100notific*^10notific~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notifica^100notifica*^10notifica~2 []
notificat^100notificat*^10notificat~2 : []
notificati^100notificati*^10notificati~2 : []
notificatio^100notificatio*^10notificatio~2 : []
notification^100notification*^10notification~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notifications^100notifications*^10notifications~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]

It's an odd behaviour that in the middle of the word notification it returns no results ?

@olivernn
Copy link
Owner

@et1421 its difficult to say without access to the dataset you are searching, can you share the index?

You can see which tokens lunr is finding for those search terms manually:

idx.tokenSet.intersect(new lunr.TokenSet.fromString("notification")) // for the exact match
idx.tokenSet.intersect(new lunr.TokenSet.fromString("notification*")) // for the prefix search
idx.tokenSet.intersect(new lunr.TokenSet.fromFuzzyString("notification", 2)) // for the fuzzy search

It might give you some clues as to why some of those are not matching, while others are. If I had to guess, I'd say that the fuzzy search is finding something that the prefix search is not, but only after a certain length. You could try reducing the amount of fuzz, perhaps 1 is a better value, or by removing it entirely.

It might be easier to understand by actually looking at what the fuzzy string is expanded too, and also give you an idea the impact it has on search performance:

new lunr.TokenSet.fromFuzzyString("notification", 2).toArray()

I see that get expanded into 1204 different strings, with an edit distance of 1 this is only 52:

new lunr.TokenSet.fromFuzzyString("notification", 1).toArray()
[
  "*otifications",
  "*notifications",
  "otifications",
  "ontifications",
  "n*tifications",
  "n*otifications",
  "ntifications",
  "ntoifications",
  "no*ifications",
  "no*tifications",
  "noifications",
  "noitfications",
  "not*fications",
  "not*ifications",
  "notfications",
  "notfiications",
  "noti*ications",
  "noti*fications",
  "notiications",
  "notiifcations",
  "notif*cations",
  "notif*ications",
  "notifcations",
  "notifciations",
  "notifi*ations",
  "notifi*cations",
  "notifiations",
  "notifiactions",
  "notific*tions",
  "notific*ations",
  "notifictions",
  "notifictaions",
  "notifica*ions",
  "notifica*tions",
  "notificaions",
  "notificaitons",
  "notificat*ons",
  "notificat*ions",
  "notificatons",
  "notificatoins",
  "notificati*ns",
  "notificati*ons",
  "notificatins",
  "notificatinos",
  "notificatio*s",
  "notificatio*ns",
  "notificatios",
  "notificatiosn",
  "notification",
  "notification*",
  "notification*s",
  "notifications"
]

@bsvensson
Copy link
Author

@olivernn - thank you for your quick replies. Your suggested query is working pretty good for us. So the original issue here is no longer an issue for us :)

We also see that the duplicate index is removed in 2.0.3.

We're still working on some finetuning for it, similar to @et1421's comments above.

@olivernn
Copy link
Owner

You might find that the fuzzy search isn't required to get good typeahead search, it looks like it might cause some unexpected results. When you do settle on something that gives good results please do report back so others can benefit from your investigation.

If there is a good general approach it is something that lunr could support more directly, i.e. a specific method on lunr.Index for performing these typeahead queries.

@olivernn
Copy link
Owner

olivernn commented May 2, 2017

@et1421 I'm going to close this issue now, if you can provide more details on the results you were seeing (specifically being able to provide the index) then please re-open this issue and I'll take a further look.

@olivernn olivernn closed this as completed May 2, 2017
@olivernn
Copy link
Owner

@bsvensson I've created a pull request with a proposal for easier support for adding wildcards to programatic queries. Would be interested to hear if this would make the implementation of your use case any cleaner.

@larskendall
Copy link

I've been having a problem with wildcard searches on version 2.1.3, which seems to be related to this issue. Wildcard searches will return results up to a point, and then stop returning them. For example, I'm building an index using the following documents...

var documents = [
  {id: 1, text: 'critical stuff'},
  {id: 2, text: 'test'}
];

If I run a search for critic*, I'll get back results. But a search for critica* or critical* will return nothing. Is this a bug in the library? Or is there some way of configuring it to get this working as expected?

@olivernn
Copy link
Owner

@larskendall without seeing how you set up your index its difficult to say for sure, but that looks like the result of stemming. "critical" stems to "critic" you can test this out with the following snippet:

idx.pipeline.runString("critical")

Further up in this thread there are a couple of suggestions for ways to express searches that will lead to the kind of results it seems you are expecting. An alternative is to disable the stemmer at build time and search time:

var idx = lunr(function() {
  //...snip...
  this.pipeline.remove(lunr.stemmer)
  this.searchPipeline.remove(lunr.stemmer)
  //...snip...
  // add documents here
})

@larskendall
Copy link

@olivernn Looks like that did the trick! Thanks so much for the quick response!

@marcellocurto
Copy link

idx.search("${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2")

Thanks! This worked well for me.

However it should be ` instead of ", like this:

idx.search(`${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2`)

@josh18
Copy link

josh18 commented Aug 10, 2021

idx.search(${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2)

Note I believe ^ may cause issues if queryTerm is multiple words as only the last word will get increased priority, does that sound right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants