Searching words with ending wildcards returns inconsistent results #256

bsvensson · 2017-04-14T17:51:43Z

For example, on https://olivernn.github.io/moonwalkers/, both pilot and pilot* will return the results I expect. However, module* returns nothing (while module works as I expect).

The text was updated successfully, but these errors were encountered:

olivernn · 2017-04-18T18:17:51Z

This looks like a problem caused by stemming.

When a search is made without a wildcard the search terms are passed through the search pipeline, which by default includes a stemmer. When a wildcard is part of the search term lunr does not pass the term through the pipeline. This is because the term with wildcard might not be a full word and therefore stemming would give incorrect results.

The reason why "module" shows this problem is that because it is stemmed to "modul" while "pilot" is stemmed to "pilot", you can try this on the demo site console:

idx.pipeline.runString("pilot") //= ["pilot"]
idx.pipeline.runString("module") //-["modul"]

So, when searching for "module*" lunr is looking in the index for anything beginning with "module", but there is nothing found, since all the "module" terms in the documents have actually been indexed as "modul". Again, you can inspect the index to see this:

idx.invertedIndex["modul"]

And trying to find the token in the set of known tokens:

idx.tokenSet.intersect(lunr.TokenSet.fromString("module*")).toArray() //= []
idx.tokenSet.intersect(lunr.TokenSet.fromString("modul*")).toArray() //= ["modul"]

I'm not sure the best way to handle this at the moment, I'll have to think about how wildcards and stemming interact before proposing a solution.

bsvensson · 2017-04-18T22:28:19Z

@olivernn - thanks for looking into this. (and thanks for lunr!)

I noticed this when we were trying to update https://developers.arcgis.com/javascript/latest/sample-code/ from lunr v.0.7.2 to v2.0.0. At 0.7.2 there wasn't a need for a wildcard - and we like that behavior.

I don't fully understand the stemming/index logic, but is there a way to get "back" the behavior from 0.7.2 where user didn't need to add wildcards in order to get "starts with" functionality?

olivernn · 2017-04-19T19:41:02Z

What I noticed from peoples use case of lunr is that, for typeahead style search, the automatic wildcard could give nice results, as shown on your site. However it would frequently cause unexpected results, just take a look through some of the closed issues.

I was thinking about what the best way to express a query for typeahead search might be, there obviously needs to be a component searching for the beginning of a string, but it should also look for exact matches. Perhaps also allow for some fuzzy matching too?

All of the above are possible with lunr. I would advise looking into the lunr.Index#query method, it is intended to be used for building queries programatically (it is used internally by lunr.Index#search).

Below is an example of what I was thinking for typeahead search:

idx.query(function (q) {
  // look for an exact match and apply a large positive boost
  q.term(queryTerm, { usePipeline: true, boost: 100 })

  // look for terms that match the beginning of this queryTerm and apply a medium boost
  q.term(queryTerm + "*", { usePipeline: false, boost: 10 })

  // look for terms that match with an edit distance of 2 and apply a small boost
  q.term(queryTerm, { usePipeline: false, editDistance: 2, boost: 1 })
})

The only slight wrinkle is having to manually append a wildcard to the query term, perhaps this should be an option, e.g. wildcard with the values trailing | leading | wrapped | none, I'll have a think about it.

You could express this within a query string and the search method like this if you want to try things out:

idx.search("${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2")

IYCI · 2017-04-20T21:08:48Z

idx.search("${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2")
Running this gives me duplicate index error in console
Not sure If I'm using it correctly, but I get duplicate index every time I try to search with multiple terms and fuzzy matches at the same time. Example: id:${queryString}^10 (${queryString}~1)

et1421 · 2017-04-21T13:11:31Z

I'm also trying to search with wildcards. For example, I have a text that contains the word "notifications" and I would like the search to yield about the same number of results while the user is writing the term in the input field. At the moment, notific gives about 13 results, while if the user adds a letter and writes notifica no results are return.

I tried the idx.search("${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2") suggestion and it also gives me duplicate index error.

I also tried to add a wildcard before and after the value *notifica* thinking it would solved the problem, but it didn't change anything.

Am I missing something ?

olivernn · 2017-04-22T09:19:46Z

There is definitely a bug that is causing the duplicate index error. It seems to be when a search query term is expanded into a term that is already being considered as part of the index. That said I still don't fully understand it as doing a search for foo foo foo does not trigger the bug.

I need to spend a bit of time trying to debug this and figure out what is going on.

olivernn · 2017-04-24T19:38:59Z

I've pushed a change that should fix the "duplicate index" error, please try version 2.0.3 and let me know if there are any issues.

IYCI · 2017-04-25T15:46:59Z

That fix it, thanks a lot.

olivernn · 2017-04-25T16:50:44Z

@IYCI glad that its working for you now, thanks for taking the time to report the problem and testing the fix.

@bsvensson does that search query I suggested work as expected for you?

et1421 · 2017-04-25T18:49:22Z

Thanks, it fixes the error.
It still doesn't return what I would expect though.

Here are the different queries followed by the number of results they return.

noti^100noti*^10noti~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notif^100notif*^10notif~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notifi^100notifi*^10notifi~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notific^100notific*^10notific~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notifica^100notifica*^10notifica~2 []
notificat^100notificat*^10notificat~2 : []
notificati^100notificati*^10notificati~2 : []
notificatio^100notificatio*^10notificatio~2 : []
notification^100notification*^10notification~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]
notifications^100notifications*^10notifications~2 : [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, Object]

It's an odd behaviour that in the middle of the word notification it returns no results ?

olivernn · 2017-04-25T19:41:52Z

@et1421 its difficult to say without access to the dataset you are searching, can you share the index?

You can see which tokens lunr is finding for those search terms manually:

idx.tokenSet.intersect(new lunr.TokenSet.fromString("notification")) // for the exact match
idx.tokenSet.intersect(new lunr.TokenSet.fromString("notification*")) // for the prefix search
idx.tokenSet.intersect(new lunr.TokenSet.fromFuzzyString("notification", 2)) // for the fuzzy search

It might give you some clues as to why some of those are not matching, while others are. If I had to guess, I'd say that the fuzzy search is finding something that the prefix search is not, but only after a certain length. You could try reducing the amount of fuzz, perhaps 1 is a better value, or by removing it entirely.

It might be easier to understand by actually looking at what the fuzzy string is expanded too, and also give you an idea the impact it has on search performance:

new lunr.TokenSet.fromFuzzyString("notification", 2).toArray()

I see that get expanded into 1204 different strings, with an edit distance of 1 this is only 52:

new lunr.TokenSet.fromFuzzyString("notification", 1).toArray()
[
  "*otifications",
  "*notifications",
  "otifications",
  "ontifications",
  "n*tifications",
  "n*otifications",
  "ntifications",
  "ntoifications",
  "no*ifications",
  "no*tifications",
  "noifications",
  "noitfications",
  "not*fications",
  "not*ifications",
  "notfications",
  "notfiications",
  "noti*ications",
  "noti*fications",
  "notiications",
  "notiifcations",
  "notif*cations",
  "notif*ications",
  "notifcations",
  "notifciations",
  "notifi*ations",
  "notifi*cations",
  "notifiations",
  "notifiactions",
  "notific*tions",
  "notific*ations",
  "notifictions",
  "notifictaions",
  "notifica*ions",
  "notifica*tions",
  "notificaions",
  "notificaitons",
  "notificat*ons",
  "notificat*ions",
  "notificatons",
  "notificatoins",
  "notificati*ns",
  "notificati*ons",
  "notificatins",
  "notificatinos",
  "notificatio*s",
  "notificatio*ns",
  "notificatios",
  "notificatiosn",
  "notification",
  "notification*",
  "notification*s",
  "notifications"
]

bsvensson · 2017-04-26T01:18:07Z

@olivernn - thank you for your quick replies. Your suggested query is working pretty good for us. So the original issue here is no longer an issue for us :)

We also see that the duplicate index is removed in 2.0.3.

We're still working on some finetuning for it, similar to @et1421's comments above.

olivernn · 2017-04-26T07:12:56Z

You might find that the fuzzy search isn't required to get good typeahead search, it looks like it might cause some unexpected results. When you do settle on something that gives good results please do report back so others can benefit from your investigation.

If there is a good general approach it is something that lunr could support more directly, i.e. a specific method on lunr.Index for performing these typeahead queries.

olivernn · 2017-05-02T16:45:15Z

@et1421 I'm going to close this issue now, if you can provide more details on the results you were seeing (specifically being able to provide the index) then please re-open this issue and I'll take a further look.

olivernn · 2017-05-11T19:26:25Z

@bsvensson I've created a pull request with a proposal for easier support for adding wildcards to programatic queries. Would be interested to hear if this would make the implementation of your use case any cleaner.

larskendall · 2017-10-26T21:41:02Z

I've been having a problem with wildcard searches on version 2.1.3, which seems to be related to this issue. Wildcard searches will return results up to a point, and then stop returning them. For example, I'm building an index using the following documents...

var documents = [
  {id: 1, text: 'critical stuff'},
  {id: 2, text: 'test'}
];

If I run a search for critic*, I'll get back results. But a search for critica* or critical* will return nothing. Is this a bug in the library? Or is there some way of configuring it to get this working as expected?

olivernn · 2017-10-27T07:20:33Z

@larskendall without seeing how you set up your index its difficult to say for sure, but that looks like the result of stemming. "critical" stems to "critic" you can test this out with the following snippet:

idx.pipeline.runString("critical")

Further up in this thread there are a couple of suggestions for ways to express searches that will lead to the kind of results it seems you are expecting. An alternative is to disable the stemmer at build time and search time:

var idx = lunr(function() {
  //...snip...
  this.pipeline.remove(lunr.stemmer)
  this.searchPipeline.remove(lunr.stemmer)
  //...snip...
  // add documents here
})

larskendall · 2017-10-27T21:11:17Z

@olivernn Looks like that did the trick! Thanks so much for the quick response!

marcellocurto · 2020-04-12T20:09:43Z

idx.search("${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2")

Thanks! This worked well for me.

However it should be ` instead of ", like this:

idx.search(`${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2`)

josh18 · 2021-08-10T10:53:38Z

idx.search(${queryTerm}^100 ${queryTerm}*^10 ${queryTerm}~2)

Note I believe ^ may cause issues if queryTerm is multiple words as only the last word will get increased priority, does that sound right?

olivernn closed this as completed May 2, 2017

olivernn mentioned this issue Jun 3, 2017

Partial matching doesn't work after certain number of digits.. #273

Closed

savikko mentioned this issue Sep 1, 2017

Wildcards support in search mrvautin/openKB#218

Closed

ozobi mentioned this issue Jan 29, 2019

Update to lunrJS 2.3.5 matcornic/hugo-theme-learn#232

Closed

ozobi mentioned this issue Mar 19, 2019

Various Updates matcornic/hugo-theme-learn#237

Merged

wilhelmer mentioned this issue Mar 9, 2020

🎉 Material 5 Beta 3 squidfunk/mkdocs-material#1483

Closed

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searching words with ending wildcards returns inconsistent results #256

Searching words with ending wildcards returns inconsistent results #256

bsvensson commented Apr 14, 2017

olivernn commented Apr 18, 2017

bsvensson commented Apr 18, 2017

olivernn commented Apr 19, 2017

IYCI commented Apr 20, 2017

et1421 commented Apr 21, 2017

olivernn commented Apr 22, 2017

olivernn commented Apr 24, 2017

IYCI commented Apr 25, 2017

olivernn commented Apr 25, 2017

et1421 commented Apr 25, 2017

olivernn commented Apr 25, 2017

bsvensson commented Apr 26, 2017

olivernn commented Apr 26, 2017

olivernn commented May 2, 2017

olivernn commented May 11, 2017

larskendall commented Oct 26, 2017

olivernn commented Oct 27, 2017

larskendall commented Oct 27, 2017

marcellocurto commented Apr 12, 2020

josh18 commented Aug 10, 2021

Searching words with ending wildcards returns inconsistent results #256

Searching words with ending wildcards returns inconsistent results #256

Comments

bsvensson commented Apr 14, 2017

olivernn commented Apr 18, 2017

bsvensson commented Apr 18, 2017

olivernn commented Apr 19, 2017

IYCI commented Apr 20, 2017

et1421 commented Apr 21, 2017

olivernn commented Apr 22, 2017

olivernn commented Apr 24, 2017

IYCI commented Apr 25, 2017

olivernn commented Apr 25, 2017

et1421 commented Apr 25, 2017

olivernn commented Apr 25, 2017

bsvensson commented Apr 26, 2017

olivernn commented Apr 26, 2017

olivernn commented May 2, 2017

olivernn commented May 11, 2017

larskendall commented Oct 26, 2017

olivernn commented Oct 27, 2017

larskendall commented Oct 27, 2017

marcellocurto commented Apr 12, 2020

josh18 commented Aug 10, 2021