Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Check all values for trie.predictive_search #35943

Merged

Conversation

congqixia
Copy link
Contributor

Related to #35941

For marisa trie predictive_search default behavior, it value iterated is not in lexicographic order.

This PR is a brute force fix to make range operator returns correct values.

Related to milvus-io#35941

For marisa trie `predictive_search` default behavior, it value iterated
is not in lexicographic order.

This PR is a brute force fix to make range operator returns correct
values.

Signed-off-by: Congqi Xia <[email protected]>
@sre-ci-robot sre-ci-robot added the size/XS Denotes a PR that changes 0-9 lines. label Sep 3, 2024
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: congqixia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels Sep 3, 2024
Copy link
Contributor

@longjiquan longjiquan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
/cc @zhagnlu

@sre-ci-robot sre-ci-robot requested a review from zhagnlu September 3, 2024 12:00
@mergify mergify bot added the ci-passed label Sep 3, 2024
Copy link

codecov bot commented Sep 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.64%. Comparing base (9da8652) to head (5c5a378).
Report is 14 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #35943      +/-   ##
==========================================
- Coverage   81.55%   72.64%   -8.92%     
==========================================
  Files        1264     1264              
  Lines      150768   150762       -6     
==========================================
- Hits       122964   109519   -13445     
- Misses      22907    36349   +13442     
+ Partials     4897     4894       -3     
Files with missing lines Coverage Δ
internal/core/src/index/StringIndexMarisa.cpp 0.00% <ø> (-64.73%) ⬇️

... and 237 files with indirect coverage changes

@xiaofan-luan
Copy link
Collaborator

  1. the current fix looks good for me.

  2. when we build index, we can sort the keys, like

  3. // Insert keys in sorted order
    

    std::vectorstd::string words = {
    "apple", "application", "banana", "bath", "cat", "dog",
    "elephant", "fish", "giraffe", "hippopotamus"
    };
    std::sort(words.begin(), words.end());

    for (const auto& word : words) {
    keyset.push_back(word.c_str());
    }

    marisa::Trie trie;
    trie.build(keyset);

then the range query can be done by

`
std::vectorstd::string rangeQuery(const marisa::Trie& trie, const std::string& start, const std::string& end) {
std::vectorstd::string results;
marisa::Agent agent;

// Start from the first key that is greater than or equal to 'start'
agent.set_query(start.c_str());
if (trie.predictive_search(agent)) {
    do {
        std::string key(agent.key().ptr(), agent.key().length());
        
        // Stop if we've gone past 'end'
        if (key > end) {
            break;
        }

        results.push_back(key);
    } while (trie.predictive_search(agent));
}

return results;

}`

@xiaofan-luan
Copy link
Collaborator

of cause we need a meta to mark whether the dataset is sorted or not.
if not we fallback

@zhagnlu
Copy link
Contributor

zhagnlu commented Sep 5, 2024

/lgtm

@sre-ci-robot sre-ci-robot merged commit 7b21032 into milvus-io:master Sep 5, 2024
15 of 16 checks passed
congqixia added a commit to congqixia/milvus that referenced this pull request Sep 5, 2024
Related to milvus-io#35941

For marisa trie `predictive_search` default behavior, it value iterated
is not in lexicographic order.

This PR is a brute force fix to make range operator returns correct
values.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this pull request Sep 5, 2024
…5999)

Cherry-pick from master
pr: #35943 
Related to #35941

For marisa trie `predictive_search` default behavior, it value iterated
is not in lexicographic order.

This PR is a brute force fix to make range operator returns correct
values.

Signed-off-by: Congqi Xia <[email protected]>
@congqixia congqixia deleted the fix/brute_force_trie_search branch September 5, 2024 10:37
congqixia added a commit to congqixia/milvus that referenced this pull request Sep 6, 2024
Related to milvus-io#35941
Previous PR: milvus-io#35943

This PR make `Trie` index using `MARISA_LABEL_ORDER`, which make
predictive search iterating in lexicographic order.

When trie index is build in label order, lexicographc could be utilized
accelerating `Range` operations.

However according to the official document, using `MARISA_LABEL_ORDER`
will make "exact match lookup, common prefix search, and predictive search"
slower.

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this pull request Sep 6, 2024
Related to milvus-io#35941
Previous PR: milvus-io#35943

This PR make `Trie` index using `MARISA_LABEL_ORDER`, which make
predictive search iterating in lexicographic order.

When trie index is build in label order, lexicographc could be utilized
accelerating `Range` operations.

However according to the official document, using `MARISA_LABEL_ORDER`
will make "exact match lookup, common prefix search, and predictive search"
slower.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this pull request Sep 6, 2024
)

Cherry pick from master
pr: #36034

Related to #35941
Previous PR: #35943

This PR make `Trie` index using `MARISA_LABEL_ORDER`, which make
predictive search iterating in lexicographic order.

When trie index is build in label order, lexicographc could be utilized
accelerating `Range` operations.

However according to the official document, using `MARISA_LABEL_ORDER`
will make "exact match lookup, common prefix search, and predictive
search" slower.

---------

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this pull request Sep 9, 2024
Related to #35941
Previous PR: #35943

This PR make `Trie` index using `MARISA_LABEL_ORDER`, which make
predictive search iterating in lexicographic order.

When trie index is build in label order, lexicographc could be utilized
accelerating `Range` operations.

However according to the official document, using `MARISA_LABEL_ORDER`
will make "exact match lookup, common prefix search, and predictive
search" slower.

---------

Signed-off-by: Congqi Xia <[email protected]>
chyezh pushed a commit to chyezh/milvus that referenced this pull request Sep 11, 2024
Related to milvus-io#35941

For marisa trie `predictive_search` default behavior, it value iterated
is not in lexicographic order.

This PR is a brute force fix to make range operator returns correct
values.

Signed-off-by: Congqi Xia <[email protected]>
chyezh pushed a commit to chyezh/milvus that referenced this pull request Sep 11, 2024
…#36034)

Related to milvus-io#35941
Previous PR: milvus-io#35943

This PR make `Trie` index using `MARISA_LABEL_ORDER`, which make
predictive search iterating in lexicographic order.

When trie index is build in label order, lexicographc could be utilized
accelerating `Range` operations.

However according to the official document, using `MARISA_LABEL_ORDER`
will make "exact match lookup, common prefix search, and predictive
search" slower.

---------

Signed-off-by: Congqi Xia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/bug Issues or changes related a bug lgtm size/XS Denotes a PR that changes 0-9 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants