Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard search not available for content field #379

Open
XueSheng-GIT opened this issue Jul 13, 2024 · 4 comments
Open

Wildcard search not available for content field #379

XueSheng-GIT opened this issue Jul 13, 2024 · 4 comments

Comments

@XueSheng-GIT
Copy link

XueSheng-GIT commented Jul 13, 2024

Description
When searching, only the fields title and share_names.user are considered for wildcard search. It's not possible to use wildcard search for the content of files. Especially for languages like German, it's hard to find something because a lot of words are joined to one word (in my example I used the Word "Barbarenfreunde" and I'm searching for "Freunde").
In addition, the current wildcard search does only use a fixed leading/following * (wildcard search only looks for *freunde* in title and share_names). It's not possible to define the available elasticsearch wildcards * and ? yourself.

Steps to reproduce:

  1. Create a new markdown file (keep default filename. it should not contain any text of the following content).
  2. Add content

Aber die Barbaren waren stark behaart
und hatten alle einen struppigen Barbarenbart (gar nicht apart),
daraufhin schickte Barbara
ihre Barbarenfreunde zum Barbarenbartbarbier

  1. Close file and let nextcloud index its content
  2. Open search in Nextcloud webif and enter one of the following terms:

Freunde
*Freunde

Search query is shown below (at the bottom of this issue).

Expected behaviour
Search result should show the above created file.

Actual behaviour
Search result does not show the above created file.

System details
OS: Ubuntu 22.04 LTS
Nextcloud: 29.0.3
Elasticsearch: 8.14.2
Fulltextsearch: 29.0.0
Fulltextsearch_Elasticsearch: 29.0.1
Files_Fulltextsearch: 29.0.0

Search query created by nextcloud:

{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": [
            {
              "match_phrase_prefix": {
                "content": "*freunde"
              }
            },
            {
              "match_phrase_prefix": {
                "title": "*freunde"
              }
            },
            {
              "match_phrase_prefix": {
                "share_names.admin": "*freunde"
              }
            },
            {
              "wildcard": {
                "title": "**freunde*"
              }
            },
            {
              "wildcard": {
                "share_names.admin": "**freunde*"
              }
            },
            {
              "query_string": {
                "fields": [
                  "parts.comments"
                ],
                "query": "*freunde"
              }
            }
          ]
        }
      },
      "filter": [
        {
          "bool": {
            "must": {
              "term": {
                "provider": "files"
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "owner.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "__all"
                }
              },
              {
                "term": {
                  "groups.keyword": "admin"
                }
              },
              {
                "term": {
                  "groups.keyword": "beta"
                }
              },
              {
                "term": {
                  "groups.keyword": "home"
                }
              },
              {
                "term": {
                  "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl"
                }
              },
              {
                "term": {
                  "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4"
                }
              },
              {
                "term": {
                  "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": []
          }
        },
        {
          "bool": {
            "must": []
          }
        },
        {
          "bool": {
            "must": []
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "content": {},
      "parts.comments": {}
    },
    "pre_tags": [
      ""
    ],
    "post_tags": [
      ""
    ]
  }
}
@XueSheng-GIT
Copy link
Author

On the first view, it seems wildcard quries (especially with a leading wildcards) are not recommended (potential slow search performance). The ngram/edge_ngram tokenizer seems to be preferred for this case.

To keep things simple, I first looked into how to get the content field added to the search query and how to be able to define the wildcards yourself. Changing the tokenizer (which would require a re-indexing and also adapted search query) should imho be a long term goal.

1. Add content field to search query (as wildcard)

Wildcard field seem to be added here.

Adding the content field to this function seems to do the trick:

diff --git a/lib/Service/SearchService.php b/lib/Service/SearchService.php
index 333dfba..d2f62ec 100644
--- a/lib/Service/SearchService.php
+++ b/lib/Service/SearchService.php
@@ -128,6 +128,7 @@ private function searchQueryShareNames(ISearchRequest $request) {
 		$request->addField('share_names.' . $username);
 
 		$request->addWildcardField('title');
+		$request->addWildcardField('content');
 		$request->addWildcardField('share_names.' . $username);
 	}

2. Respect wildcards entered in search field

Predefined wildcards seem to be added here.

Following change does check for existing wildcards and avoids adding additional wildcards in this case.

diff --git a/lib/Service/SearchMappingService.php b/lib/Service/SearchMappingService.php
index f24c2abf..b8b42ac0 100644
--- a/lib/Service/SearchMappingService.php
+++ b/lib/Service/SearchMappingService.php
@@ -274,8 +274,13 @@ private function generateQueryContentFields(ISearchRequest $request, QueryConten
 		}
 
 		foreach ($request->getWildcardFields() as $field) {
+			$word = $content->getWord();
 			if (!$this->fieldIsOutLimit($request, $field)) {
-				$queryFields[] = ['wildcard' => [$field => '*' . $content->getWord() . '*']];
+				if (strpos($word, '*') !== false || strpos($word, '?') !== false) {
+					$queryFields[] = ['wildcard' => [$field => $word]];
+				} else {
+					$queryFields[] = ['wildcard' => [$field => '*' . $word . '*']];
+				}
 			}
 		}

After applying those changes, files are found as expected (issue mentioned in the original post solved). I tried this on three instances I'm running and wasn't able to notice any practical performance impact (of course that's not representative in any way 😉... expecially, as I didn't mention any details about the size of the indexes involved).

I'm quite sure that wildcard search was working for the content field a couple of years ago (at least I created some personal documentation with wildcard search examples which stopped working at some point).
Thus, it is possible that this function was disabled by intention. It could also be that it was just disabled unintentionally. If performance impact is a general concern, wildcard search within the content field could be an option.

@R0Wi Do you have any insights in this regards? Any suggestion/alternative approach how to solve this issue?

@R0Wi
Copy link
Member

R0Wi commented Jul 14, 2024

Hey @XueSheng-GIT, thanks for the comprehensive insights - really impressive 👍 Unfortunately, I don't have too much historical knowledge about the content field being removed from the wildcard search. But I also remember that this was possible in earlier versions, so for advandced users this will definitely be helpful. We might want to keep @ArtificialOwl in the loop, maybe he has some more info for us.

From my point of view you did a pretty well research and the technical solution looks good to me. Maybe we could think about making the wildcard search in content configurable via settings to avoid any performance bottlenecks for users/admins who don't want to use this feature? Also, in your initial post you provided the full JSON body being created by the app. I'd be interested in how this body looks like now, after applying your adjustments. Maybe you could give us some example here as well?

@XueSheng-GIT
Copy link
Author

@R0Wi thanks for your quick reply!
Some examples for the updated JSON body after patches applied (#379 (comment)). All those examples do match the initially mentioned example and the related file is presented as result. This is not the case for the default (unpatched) fulltextsearch.

1. Search term: Freunde

  • Additional wildcard for content field
  • defauilt wildcards added because no wildcard was defined in search term
Show JSON body
{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": [
            {
              "match_phrase_prefix": {
                "content": "freunde"
              }
            },
            {
              "match_phrase_prefix": {
                "title": "freunde"
              }
            },
            {
              "match_phrase_prefix": {
                "share_names.admin": "freunde"
              }
            },
            {
              "wildcard": {
                "title": "*freunde*"
              }
            },
            {
              "wildcard": {
                "content": "*freunde*"
              }
            },
            {
              "wildcard": {
                "share_names.admin": "*freunde*"
              }
            },
            {
              "query_string": {
                "fields": [
                  "parts.comments"
                ],
                "query": "freunde"
              }
            }
          ]
        }
      },
      "filter": [
        {
          "bool": {
            "must": {
              "term": {
                "provider": "files"
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "owner.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "__all"
                }
              },
              {
                "term": {
                  "groups.keyword": "admin"
                }
              },
              {
                "term": {
                  "groups.keyword": "beta"
                }
              },
              {
                "term": {
                  "groups.keyword": "home"
                }
              },
              {
                "term": {
                  "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl"
                }
              },
              {
                "term": {
                  "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4"
                }
              },
              {
                "term": {
                  "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": []
          }
        },
        {
          "bool": {
            "must": []
          }
        },
        {
          "bool": {
            "must": []
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "content": {},
      "parts.comments": {}
    },
    "pre_tags": [
      ""
    ],
    "post_tags": [
      ""
    ]
  }
}

2. Search term: *Freunde

  • Additional wildcard for content field
  • no default wildcard added because it was defined as part of search term
Show JSON body
{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": [
            {
              "match_phrase_prefix": {
                "content": "*freunde"
              }
            },
            {
              "match_phrase_prefix": {
                "title": "*freunde"
              }
            },
            {
              "match_phrase_prefix": {
                "share_names.admin": "*freunde"
              }
            },
            {
              "wildcard": {
                "title": "*freunde"
              }
            },
            {
              "wildcard": {
                "content": "*freunde"
              }
            },
            {
              "wildcard": {
                "share_names.admin": "*freunde"
              }
            },
            {
              "query_string": {
                "fields": [
                  "parts.comments"
                ],
                "query": "*freunde"
              }
            }
          ]
        }
      },
      "filter": [
        {
          "bool": {
            "must": {
              "term": {
                "provider": "files"
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "owner.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "__all"
                }
              },
              {
                "term": {
                  "groups.keyword": "admin"
                }
              },
              {
                "term": {
                  "groups.keyword": "beta"
                }
              },
              {
                "term": {
                  "groups.keyword": "home"
                }
              },
              {
                "term": {
                  "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl"
                }
              },
              {
                "term": {
                  "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4"
                }
              },
              {
                "term": {
                  "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": []
          }
        },
        {
          "bool": {
            "must": []
          }
        },
        {
          "bool": {
            "must": []
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "content": {},
      "parts.comments": {}
    },
    "pre_tags": [
      ""
    ],
    "post_tags": [
      ""
    ]
  }
}

3. Search term: +*ruppigen +Barbar*

  • combination of manual wildcard with OPTION_MUST
Show JSON body
{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "must": [
            {
              "bool": {
                "should": [
                  {
                    "match_phrase_prefix": {
                      "content": "*ruppigen"
                    }
                  },
                  {
                    "match_phrase_prefix": {
                      "title": "*ruppigen"
                    }
                  },
                  {
                    "match_phrase_prefix": {
                      "share_names.admin": "*ruppigen"
                    }
                  },
                  {
                    "wildcard": {
                      "title": "*ruppigen"
                    }
                  },
                  {
                    "wildcard": {
                      "content": "*ruppigen"
                    }
                  },
                  {
                    "wildcard": {
                      "share_names.admin": "*ruppigen"
                    }
                  },
                  {
                    "query_string": {
                      "fields": [
                        "parts.comments"
                      ],
                      "query": "*ruppigen"
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "match_phrase_prefix": {
                      "content": "barbar*"
                    }
                  },
                  {
                    "match_phrase_prefix": {
                      "title": "barbar*"
                    }
                  },
                  {
                    "match_phrase_prefix": {
                      "share_names.admin": "barbar*"
                    }
                  },
                  {
                    "wildcard": {
                      "title": "barbar*"
                    }
                  },
                  {
                    "wildcard": {
                      "content": "barbar*"
                    }
                  },
                  {
                    "wildcard": {
                      "share_names.admin": "barbar*"
                    }
                  },
                  {
                    "query_string": {
                      "fields": [
                        "parts.comments"
                      ],
                      "query": "barbar*"
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "filter": [
        {
          "bool": {
            "must": {
              "term": {
                "provider": "files"
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "owner.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "__all"
                }
              },
              {
                "term": {
                  "groups.keyword": "admin"
                }
              },
              {
                "term": {
                  "groups.keyword": "beta"
                }
              },
              {
                "term": {
                  "groups.keyword": "home"
                }
              },
              {
                "term": {
                  "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl"
                }
              },
              {
                "term": {
                  "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4"
                }
              },
              {
                "term": {
                  "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": []
          }
        },
        {
          "bool": {
            "must": []
          }
        },
        {
          "bool": {
            "must": []
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "content": {},
      "parts.comments": {}
    },
    "pre_tags": [
      ""
    ],
    "post_tags": [
      ""
    ]
  }
}

4. Search term: +"Barbaren waren" +??ruppigen

  • combination of different manual wildcards, quoted text block and OPTION_MUST.
Show JSON body
{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "must": [
            {
              "bool": {
                "should": [
                  {
                    "match_phrase_prefix": {
                      "content": "barbaren waren"
                    }
                  },
                  {
                    "match_phrase_prefix": {
                      "title": "barbaren waren"
                    }
                  },
                  {
                    "match_phrase_prefix": {
                      "share_names.admin": "barbaren waren"
                    }
                  },
                  {
                    "wildcard": {
                      "title": "*barbaren waren*"
                    }
                  },
                  {
                    "wildcard": {
                      "content": "*barbaren waren*"
                    }
                  },
                  {
                    "wildcard": {
                      "share_names.admin": "*barbaren waren*"
                    }
                  },
                  {
                    "query_string": {
                      "fields": [
                        "parts.comments"
                      ],
                      "query": "barbaren waren"
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "match_phrase_prefix": {
                      "content": "??ruppigen"
                    }
                  },
                  {
                    "match_phrase_prefix": {
                      "title": "??ruppigen"
                    }
                  },
                  {
                    "match_phrase_prefix": {
                      "share_names.admin": "??ruppigen"
                    }
                  },
                  {
                    "wildcard": {
                      "title": "??ruppigen"
                    }
                  },
                  {
                    "wildcard": {
                      "content": "??ruppigen"
                    }
                  },
                  {
                    "wildcard": {
                      "share_names.admin": "??ruppigen"
                    }
                  },
                  {
                    "query_string": {
                      "fields": [
                        "parts.comments"
                      ],
                      "query": "??ruppigen"
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "filter": [
        {
          "bool": {
            "must": {
              "term": {
                "provider": "files"
              }
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "owner.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "admin"
                }
              },
              {
                "term": {
                  "users.keyword": "__all"
                }
              },
              {
                "term": {
                  "groups.keyword": "admin"
                }
              },
              {
                "term": {
                  "groups.keyword": "beta"
                }
              },
              {
                "term": {
                  "groups.keyword": "home"
                }
              },
              {
                "term": {
                  "circles.keyword": "B1RPHEMEhjLcEloE7GzQvqyM3UJltkl"
                }
              },
              {
                "term": {
                  "circles.keyword": "TcR2hjPVaFv4uYkUlCO8p1MzhFlwcf4"
                }
              },
              {
                "term": {
                  "circles.keyword": "cvMawK84jxklszzOCcOb538nTKns2Yf"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": []
          }
        },
        {
          "bool": {
            "must": []
          }
        },
        {
          "bool": {
            "must": []
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "content": {},
      "parts.comments": {}
    },
    "pre_tags": [
      ""
    ],
    "post_tags": [
      ""
    ]
  }
}

@XueSheng-GIT
Copy link
Author

@ArtificialOwl Do you have any insights, why content field is not part of the wildcard search? Any recommendation how to proceed with this topic? As mentioned by @R0Wi, an additional setting could be an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants