Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fulltext query with wildcard #794

Closed
xiaoyifang opened this issue May 12, 2023 · 11 comments
Closed

Support fulltext query with wildcard #794

xiaoyifang opened this issue May 12, 2023 · 11 comments

Comments

@xiaoyifang
Copy link
Contributor

xiaoyifang commented May 12, 2023

Some questions:

  1. does libzim support query wildcard string ,such as "he*o"
    https://xapian.org/docs/apidoc/html/classXapian_1_1QueryParser.html#:~:text=FLAG_WILDCARD-,Support%20wildcards.,-At%20present%20only
    3. does libzim support CJK tokenize/query? See Support CJK index creation and query #802
@kelson42 kelson42 added this to the 8.3.0 milestone May 12, 2023
@mgautierfr
Copy link
Collaborator

does libzim support query wildcard string ,such as "he*o"

Mostly no.

If it is getEntryByPath or findByPath, the answer is clearly no. We are at the byte level and we search for exact match.
findByPath is searching for a prefix, so "foo" is equivalent to "foo.*" But that's all.

If it is fulltext search or suggestion search, no also. The query is passed to xapian which will parse it but we don't set FLAG_WILDCARD . But even if we set it, from the documentation, only "foo*" would be supported, not "he*o".
For suggestion search, we use the FLAG_PARTIAL which somehow add a * at the end of the query (see the doc for the details).

does libzim support CJK tokenize/query?

Libzim works at byte level and the specification tell to store utf8.
So if you pass a utf8 encoded CJK string, it should works.

@xiaoyifang
Copy link
Contributor Author

xiaoyifang commented May 12, 2023

goldendict-ng has use xapian as fulltext engine too. in order to search cjk character , FLAG_CJK_NGRAM has to be passed to termgenerator. without this FLAG_CJK_NGRAM flag ,the query result is not correct as I can remember.

https://xapian.org/docs/sourcedoc/html/classXapian_1_1TermGenerator.html#ad1dbf8af7a6b0d5a7f0dca5f1202a291a0b31ee76f4e202359c590645280c7027:~:text=for%20spelling%20correction.-,FLAG_CJK_NGRAM%C2%A0,-Enable%20generation%20of

If libzim support CJK search ,I can skip the zim dictionary's fulltext creation in goldendict-ng and use libzim to query zim's built-in fulltext ,which should save a lot of disk space.

@mgautierfr
Copy link
Collaborator

Hum.. Maybe we have to use the FLAG_CJK_NGRAM then. I don't see how it could break things.

@kelson42
Copy link
Contributor

kelson42 commented May 12, 2023

I probably don't fully understand, but we decided to no save positional infirmation a long time ago to save index storage space. Not sure, this CJK flag can work without that kind of information. Glad to learn that I'm wrong if this is the case.

@xiaoyifang
Copy link
Contributor Author

xiaoyifang commented May 12, 2023

CJK character are not space delimited as English. the default tokenize is wrong ,which lead to wrong query result.
" positional infirmation " has cost too much extra space(1x more disk space than without it), it is a different issue with CJK . CJK flag can work without the positional information .

a defect without positional information is that the result is a little more than actual. This can be discussed alone.

@kelson42
Copy link
Contributor

kelson42 commented Jun 2, 2023

@xiaoyifang I would like to move forward with this ticket. One of the problem is that the feature request is no very clear from the user perspective. Can we please clarify this:

  • Do we talk about the Xapian based suggestions? About fulltext search? Both?
  • what is exactly meant with «  cjk character search »? Can we have a concrete use case example? what this applies exactly:xapian suggestions or/and ft search?

If we are talking about two different feature requests, we should probably have two tickets.

@xiaoyifang
Copy link
Contributor Author

I talked about fulltext search . I do not know much about the xapian-based suggestions.

what is exactly meant with « cjk character search »? Can we have a concrete use case example?

I have migrated Goldendict fulltext engine to xapian. During the migration ,I have found that if not enabled CJK flag

    Xapian::TermGenerator indexer;
    //  Xapian::Stem stemmer("english");
    //  indexer.set_stemmer(stemmer);
    //  indexer.set_stemming_strategy(indexer.STEM_SOME_FULL_POS);
    indexer.set_flags( Xapian::TermGenerator::FLAG_CJK_NGRAM );

it will give less results when searched in Chinese Dictionary.

@kelson42
Copy link
Contributor

kelson42 commented Jun 2, 2023

I talked about fulltext search . I do not know much about the xapian-based suggestions.

The suggestion search is a way to fournish article title base suggestions (completion approach).

it will give less results when searched in Chinese Dictionary.

Please open a dedicated ticket for this. To me this sounds like a serious bug and I'm in favour to fix this ASAP if the chinese search fails to work properly.

This ticket should be from now only focused on the wildcard fulltext search.

@kelson42
Copy link
Contributor

kelson42 commented Jun 2, 2023

This ticket requests that we allow wildcard fulltext searches. Here my remarks:

  • This feature has been ignored so far in the history of Kiwix has I think this is not really needed, or at least not a priority.
  • I would be ready to consider it if it would be pretty easy to implement and without strong impact at user level (the biggest of it, being that the xapian index size).

@mgautierfr Do you think we could implement this feature without user impact?

@xiaoyifang
Copy link
Contributor Author

it will give less results when searched in Chinese Dictionary.

Please open a dedicated ticket for this. To me this sounds like a serious bug and I'm in favour to fix this ASAP if the chinese search fails to work properly.

#802

@kelson42 kelson42 changed the title question,help: does libzim support query wildcard string and CJK ? Support fulltext query with wildcard <s>and CJK</s> Jun 15, 2023
@kelson42 kelson42 changed the title Support fulltext query with wildcard <s>and CJK</s> Support fulltext query with wildcard and CJK Jun 15, 2023
@kelson42 kelson42 modified the milestones: 9.0.0, 9.1.0 Sep 26, 2023
@kelson42 kelson42 modified the milestones: 9.1.0, 10.0.0 Nov 1, 2023
@kelson42 kelson42 modified the milestones: 11.0.0, 9.2.0 Dec 15, 2023
@kelson42 kelson42 pinned this issue Dec 26, 2023
@kelson42 kelson42 removed this from the 9.2.0 milestone Jan 5, 2024
@kelson42 kelson42 modified the milestone: 10.0.0 Jan 5, 2024
@kelson42 kelson42 changed the title Support fulltext query with wildcard and CJK Support fulltext query with wildcard Jan 5, 2024
@kelson42
Copy link
Contributor

CJK has been implement which was definitly the most important for the rest I think we will pass on wildcards.

@kelson42 kelson42 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 13, 2024
@kelson42 kelson42 added this to the 9.2.0 milestone Jan 13, 2024
@kelson42 kelson42 unpinned this issue Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants