Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Search/Query may failed during updating delegator cache #37115

Closed
1 task done
weiliu1031 opened this issue Oct 24, 2024 · 1 comment
Closed
1 task done

[Bug]: Search/Query may failed during updating delegator cache #37115

weiliu1031 opened this issue Oct 24, 2024 · 1 comment
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@weiliu1031
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master and 2.4
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

proxy will update shard leader cache first, then release the lock and try to init shard client. which cause a period that user can get shard leader from meta cache, but can't find shard client from shard client manager

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@weiliu1031 weiliu1031 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2024
@weiliu1031
Copy link
Contributor Author

/assign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 25, 2024
@yanliang567 yanliang567 removed their assignment Oct 25, 2024
@yanliang567 yanliang567 added this to the 2.4.14 milestone Oct 25, 2024
sre-ci-robot pushed a commit that referenced this issue Oct 28, 2024
issue: #37115
pr: #37116
casue init query node client is too heavy, so we remove
updateShardClient from leader mutex, which cause much more concurrent
cornor cases.

This PR delay query node client's init operation until `getClient` is
called, then use leader mutex to protect updating shard client progress
to avoid concurrent issues.

---------

Signed-off-by: Wei Liu <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Nov 5, 2024
issue: #37115

casue init query node client is too heavy, so we remove
updateShardClient from leader mutex, which cause much more concurrent
cornor cases.

This PR delay query node client's init operation until `getClient` is
called, then use leader mutex to protect updating shard client progress
to avoid concurrent issues.

---------

Signed-off-by: Wei Liu <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Nov 5, 2024
issue: #37115

pr#37116 let proxy retry to get shard leader if error happens, which
cause if search/query on a unloaded collection, which will keep retrying
until ctx done.

This PR add error type check to skip retry on ErrCollectionLoaded.

Signed-off-by: Wei Liu <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Nov 12, 2024
issue: #37115
the old implementation update shard cache and shard client manager at
same time, which causes lots of conor case due to concurrent issue
without lock.

This PR decouple shard client manager from shard cache, so only shard
cache will be updated if delegator changes. and make sure shard client
manager will always return the right client, and create a new client if
not exist. in case of client leak, shard client manager will purge
client in async for every 10 minutes.

---------

Signed-off-by: Wei Liu <[email protected]>
@yanliang567 yanliang567 modified the milestones: 2.4.14, 2.4.16 Nov 14, 2024
weiliu1031 added a commit to weiliu1031/milvus that referenced this issue Nov 17, 2024
)

issue: milvus-io#37115
the old implementation update shard cache and shard client manager at
same time, which causes lots of conor case due to concurrent issue
without lock.

This PR decouple shard client manager from shard cache, so only shard
cache will be updated if delegator changes. and make sure shard client
manager will always return the right client, and create a new client if
not exist. in case of client leak, shard client manager will purge
client in async for every 10 minutes.

---------

Signed-off-by: Wei Liu <[email protected]>
@yanliang567 yanliang567 modified the milestones: 2.4.16, 2.4.17, 2.4.18 Nov 21, 2024
sre-ci-robot pushed a commit that referenced this issue Nov 25, 2024
)

issue: #37115
pr: #37371 #37646 #37729
the old implementation update shard cache and shard client manager at
same time, which causes lots of conor case due to concurrent issue
without lock.

This PR decouple shard client manager from shard cache, so only shard
cache will be updated if delegator changes. and make sure shard client
manager will always return the right client, and create a new client if
not exist. in case of client leak, shard client manager will purge
client in async for every 10 minutes.

---------

---------

Signed-off-by: Wei Liu <[email protected]>
Signed-off-by: Congqi Xia <[email protected]>
Co-authored-by: congqixia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

2 participants