[Bug]: Search/Query may failed during updating delegator cache #37115

weiliu1031 · 2024-10-24T09:50:38Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:master and 2.4
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

proxy will update shard leader cache first, then release the lock and try to init shard client. which cause a period that user can get shard leader from meta cache, but can't find shard client from shard client manager

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

weiliu1031 · 2024-10-24T09:50:47Z

/assign

issue: #37115 pr: #37116 casue init query node client is too heavy, so we remove updateShardClient from leader mutex, which cause much more concurrent cornor cases. This PR delay query node client's init operation until `getClient` is called, then use leader mutex to protect updating shard client progress to avoid concurrent issues. --------- Signed-off-by: Wei Liu <[email protected]>

issue: #37115 Signed-off-by: Wei Liu <[email protected]>

issue: #37115 casue init query node client is too heavy, so we remove updateShardClient from leader mutex, which cause much more concurrent cornor cases. This PR delay query node client's init operation until `getClient` is called, then use leader mutex to protect updating shard client progress to avoid concurrent issues. --------- Signed-off-by: Wei Liu <[email protected]>

issue: #37115 pr#37116 let proxy retry to get shard leader if error happens, which cause if search/query on a unloaded collection, which will keep retrying until ctx done. This PR add error type check to skip retry on ErrCollectionLoaded. Signed-off-by: Wei Liu <[email protected]>

issue: #37115 the old implementation update shard cache and shard client manager at same time, which causes lots of conor case due to concurrent issue without lock. This PR decouple shard client manager from shard cache, so only shard cache will be updated if delegator changes. and make sure shard client manager will always return the right client, and create a new client if not exist. in case of client leak, shard client manager will purge client in async for every 10 minutes. --------- Signed-off-by: Wei Liu <[email protected]>

) issue: milvus-io#37115 the old implementation update shard cache and shard client manager at same time, which causes lots of conor case due to concurrent issue without lock. This PR decouple shard client manager from shard cache, so only shard cache will be updated if delegator changes. and make sure shard client manager will always return the right client, and create a new client if not exist. in case of client leak, shard client manager will purge client in async for every 10 minutes. --------- Signed-off-by: Wei Liu <[email protected]>

) issue: #37115 pr: #37371 #37646 #37729 the old implementation update shard cache and shard client manager at same time, which causes lots of conor case due to concurrent issue without lock. This PR decouple shard client manager from shard cache, so only shard cache will be updated if delegator changes. and make sure shard client manager will always return the right client, and create a new client if not exist. in case of client leak, shard client manager will purge client in async for every 10 minutes. --------- --------- Signed-off-by: Wei Liu <[email protected]> Signed-off-by: Congqi Xia <[email protected]> Co-authored-by: congqixia <[email protected]>

weiliu1031 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2024

weiliu1031 assigned yanliang567 Oct 24, 2024

sre-ci-robot assigned weiliu1031 Oct 24, 2024

weiliu1031 mentioned this issue Oct 24, 2024

fix: Search/Query may failed during updating delegator cache. #37116

Merged

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 25, 2024

yanliang567 removed their assignment Oct 25, 2024

yanliang567 added this to the 2.4.14 milestone Oct 25, 2024

weiliu1031 mentioned this issue Oct 28, 2024

fix: Search/Query may failed during updating delegator cache #37174

Merged

sre-ci-robot pushed a commit that referenced this issue Nov 5, 2024

fix: dead lock if query node crash during shard client init (#37354)

eb712f0

issue: #37115 Signed-off-by: Wei Liu <[email protected]>

yanliang567 modified the milestones: 2.4.14, 2.4.16 Nov 14, 2024

weiliu1031 mentioned this issue Nov 17, 2024

enhance: Decouple shard client manager from shard cache (#37371) #37753

Merged

yanliang567 modified the milestones: 2.4.16, 2.4.17, 2.4.18 Nov 21, 2024

weiliu1031 closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Search/Query may failed during updating delegator cache #37115

[Bug]: Search/Query may failed during updating delegator cache #37115

weiliu1031 commented Oct 24, 2024

weiliu1031 commented Oct 24, 2024

[Bug]: Search/Query may failed during updating delegator cache #37115

[Bug]: Search/Query may failed during updating delegator cache #37115

Comments

weiliu1031 commented Oct 24, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

weiliu1031 commented Oct 24, 2024