-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: During rolling upgrades of the query node under high-load workloads, there may be approximately 100 seconds of interruption in search/query services. #36228
Comments
/assign @weiliu1031 |
/unassign |
same for creating index
|
this also reproduced when upgrading from v2.4.5 to 2.4-20240912-ab31728b-amd64 |
In the recent 2.4.10 --> master-20241028-fc69df44-amd64 mixcoord rolling upgrade, the interruption time for search and query was around 11s.
log: @weiliu1031 |
known issue, should be fixed by #36880 |
/assign @zhuwenxing |
still reproduced in and downtime increased log: artifacts-kafka-mixcoord-5443-server-logs.tar.gz
|
for standalone upgrade,only index failed failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5445/pipeline
|
for rolling upgrade, 10s still seems to be a long time? 1s is expected since it is a graceful shut down |
seems caused by the high workload on physical server, which caused etcd access issue, please verify this with same image |
/assign @zhuwenxing |
verifying with v2.4.14 --> master-20241111-fca946de-amd64 log:
query/search start to fail at 2024-11-11 17:10:34.816652 at this time point, mixcoord is starting to upgrade
meta op latency increase also happened at this time |
the weird part is meta op put fail. because even under recovery the most operation done is get or scan rather than put/txn. @weiliu1031 I think we need more clues on:
|
Can we log some of them out for debugging purpose? |
back to the original issue, rootcoord cost 10s to load meta done, which will cause a 10s unavailable for mixcoord during rolling upgrade. and after we fix the etcd operation latency, we found the average |
15ms seems to be too long for etcd. if that's the problem we might need to think of concurrently recover data from etcd or lazy load |
Verified large number of collections will not trigger longer loading time. Need more logs to trouble shooting, suggest to mark it as unblocking |
ok @weiliu1031 |
|
issue: #36228 pr: #37742 Signed-off-by: Wei Liu <[email protected]>
issue: #36228 Signed-off-by: Wei Liu <[email protected]>
please verify this with latest image |
/assign @zhuwenxing |
Kafka as an mq rolling upgrade is blocked, Pulsar will be used as mq for verification. |
still reproduced in artifacts-pulsar-mixcoord-5530-server-logs.tar.gz
|
Saw big pulsar lag but didn't see any obvious signal on resource usage side. Let's turn on pulsar's metrics and reevaluate /assign @yanliang567 @zhuwenxing |
check disk usage? there maybe disk performance issue on our test environment |
Is there an existing issue for this?
Environment
Current Behavior
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/4837/pipeline
log:
kafka mixcoord 4837 server logs.tar.gz
cluster:4am
ns: chaos-testing
pod:
Anything else?
No response
The text was updated successfully, but these errors were encountered: