Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: During rolling upgrades of the query node under high-load workloads, there may be approximately 100 seconds of interruption in search/query services. #36228

Open
1 task done
zhuwenxing opened this issue Sep 13, 2024 · 30 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.3--> master-20240912-e8840a1b-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-09-12T17:40:24.392Z] FAILED testcases/test_concurrent_request_operation_for_rolling_update.py::TestOperations::test_operations - pytest_assume.plugin.FailedAssumption: 

[2024-09-12T17:40:24.392Z] 2 Failed Assumptions:

[2024-09-12T17:40:24.392Z] 

[2024-09-12T17:40:24.392Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-09-12T17:40:24.392Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:40:24.392Z] AssertionError: Op.search rto expect 10s but get 106.86897778511047s

[2024-09-12T17:40:24.393Z] assert False

[2024-09-12T17:40:24.393Z] 

[2024-09-12T17:40:24.393Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-09-12T17:40:24.393Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:40:24.393Z] AssertionError: Op.query rto expect 10s but get 106.65217280387878s

[2024-09-12T17:40:24.393Z] assert False

[2024-09-12T17:40:24.393Z] ================== 1 failed, 1 warning in 1990.94s (0:33:10) ===================

image

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/4837/pipeline

log:

kafka mixcoord 4837 server logs.tar.gz

cluster:4am
ns: chaos-testing
pod:

[2024-09-12T17:42:56.171Z] + kubectl get pods -o wide

[2024-09-12T17:42:56.173Z] + grep kafka-mixcoord-4837

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-etcd-0                                       1/1     Running       0               40m     10.104.18.121   4am-node25   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-etcd-1                                       1/1     Running       0               40m     10.104.19.203   4am-node28   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-etcd-2                                       1/1     Running       0               40m     10.104.23.183   4am-node27   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-0                                      2/2     Running       1 (39m ago)     40m     10.104.18.123   4am-node25   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-1                                      2/2     Running       1 (39m ago)     40m     10.104.23.187   4am-node27   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-2                                      2/2     Running       1 (39m ago)     40m     10.104.26.192   4am-node32   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-exporter-c56d4d748-ph7zp               1/1     Running       4 (39m ago)     40m     10.104.13.2     4am-node16   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-zookeeper-0                            1/1     Running       0               40m     10.104.20.29    4am-node22   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-zookeeper-1                            1/1     Running       0               40m     10.104.23.189   4am-node27   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-zookeeper-2                            1/1     Running       0               40m     10.104.18.125   4am-node25   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-milvus-datanode-6df9d566b7-g667b             1/1     Running       0               23m     10.104.17.205   4am-node23   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-milvus-datanode-6df9d566b7-nxv7r             1/1     Running       0               22m     10.104.6.82     4am-node13   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-milvus-datanode-6df9d566b7-pn76j             1/1     Running       0               24m     10.104.25.182   4am-node30   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-milvus-indexnode-7ddd7959dc-66784            1/1     Running       0               31m     10.104.6.72     4am-node13   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-indexnode-7ddd7959dc-9dsrj            1/1     Running       0               31m     10.104.20.36    4am-node22   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-indexnode-7ddd7959dc-sxpps            1/1     Running       0               30m     10.104.17.197   4am-node23   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-mixcoord-794b46bcd5-rxdh2             1/1     Running       1 (26m ago)     27m     10.104.25.173   4am-node30   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-proxy-5fff9fdcb7-kljbb                1/1     Running       0               22m     10.104.6.81     4am-node13   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-querynode-1-698bdc6bdc-2znhb          1/1     Running       0               25m     10.104.25.179   4am-node30   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-querynode-1-698bdc6bdc-96ncx          1/1     Running       0               23m     10.104.6.80     4am-node13   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-querynode-1-698bdc6bdc-tf8ml          1/1     Running       0               24m     10.104.17.204   4am-node23   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-minio-0                                      1/1     Running       0               40m     10.104.20.28    4am-node22   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-minio-1                                      1/1     Running       0               40m     10.104.19.204   4am-node28   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-minio-2                                      1/1     Running       0               40m     10.104.18.124   4am-node25   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-minio-3                                      1/1     Running       0               40m     10.104.23.182   4am-node27   <none>           <none>

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 13, 2024
@zhuwenxing
Copy link
Contributor Author

/assign @weiliu1031
PTAL

@yanliang567
Copy link
Contributor

/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 13, 2024
@yanliang567 yanliang567 added this to the 2.5.0 milestone Sep 13, 2024
@zhuwenxing
Copy link
Contributor Author

same for creating index

[2024-09-12T17:42:53.629Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-09-12T17:42:53.629Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:42:53.629Z] AssertionError: Op.insert rto expect 10s but get 56.61560535430908s

[2024-09-12T17:42:53.629Z] assert False

[2024-09-12T17:42:53.629Z] 

[2024-09-12T17:42:53.629Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-09-12T17:42:53.629Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:42:53.629Z] AssertionError: Op.index rto expect 10s but get 856.4799315929413s

[2024-09-12T17:42:53.629Z] assert False

[2024-09-12T17:42:53.629Z] 

[2024-09-12T17:42:53.629Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-09-12T17:42:53.629Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:42:53.629Z] AssertionError: Op.search rto expect 10s but get 106.47509121894836s

[2024-09-12T17:42:53.629Z] assert False

[2024-09-12T17:42:53.629Z] 

[2024-09-12T17:42:53.629Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-09-12T17:42:53.629Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:42:53.629Z] AssertionError: Op.query rto expect 10s but get 106.66716194152832s

image

@zhuwenxing
Copy link
Contributor Author

@zhuwenxing
Copy link
Contributor Author

In the recent 2.4.10 --> master-20241028-fc69df44-amd64 mixcoord rolling upgrade, the interruption time for search and query was around 11s.


[2024-10-28T17:41:02.697Z] FAILED testcases/test_concurrent_request_operation_for_rolling_update.py::TestOperations::test_operations - pytest_assume.plugin.FailedAssumption: 

[2024-10-28T17:41:02.697Z] 2 Failed Assumptions:

[2024-10-28T17:41:02.697Z] 

[2024-10-28T17:41:02.697Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-10-28T17:41:02.697Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-10-28T17:41:02.697Z] AssertionError: Op.search rto expect 10s but get 11.00616192817688s

[2024-10-28T17:41:02.697Z] assert False

[2024-10-28T17:41:02.697Z] 

[2024-10-28T17:41:02.697Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-10-28T17:41:02.697Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-10-28T17:41:02.697Z] AssertionError: Op.query rto expect 10s but get 11.004719734191895s

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5410/pipeline

log:
artifacts-kafka-mixcoord-5410-server-logs.tar.gz

@weiliu1031
I feel that this result can still be optimized.

@weiliu1031
Copy link
Contributor

In the recent 2.4.10 --> master-20241028-fc69df44-amd64 mixcoord rolling upgrade, the interruption time for search and query was around 11s.


[2024-10-28T17:41:02.697Z] FAILED testcases/test_concurrent_request_operation_for_rolling_update.py::TestOperations::test_operations - pytest_assume.plugin.FailedAssumption: 

[2024-10-28T17:41:02.697Z] 2 Failed Assumptions:

[2024-10-28T17:41:02.697Z] 

[2024-10-28T17:41:02.697Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-10-28T17:41:02.697Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-10-28T17:41:02.697Z] AssertionError: Op.search rto expect 10s but get 11.00616192817688s

[2024-10-28T17:41:02.697Z] assert False

[2024-10-28T17:41:02.697Z] 

[2024-10-28T17:41:02.697Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-10-28T17:41:02.697Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-10-28T17:41:02.697Z] AssertionError: Op.query rto expect 10s but get 11.004719734191895s

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5410/pipeline

log: artifacts-kafka-mixcoord-5410-server-logs.tar.gz

@weiliu1031 I feel that this result can still be optimized.

known issue, should be fixed by #36880
please verify this with the the commit after this pr

@weiliu1031
Copy link
Contributor

/assign @zhuwenxing

@zhuwenxing
Copy link
Contributor Author

still reproduced in v2.4.14 --> master-20241105-b83b376c-amd64

and downtime increased

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5443/pipeline

log:

artifacts-kafka-mixcoord-5443-server-logs.tar.gz


[2024-11-05T06:51:12.600Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[7-10] - pytest_assume.plugin.FailedAssumption: 

[2024-11-05T06:51:12.600Z] 4 Failed Assumptions:

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(

[2024-11-05T06:51:12.600Z] AssertionError: Expect Succ: Op.index succ rate 0.9759036144578314, total: 83, average time: 18.9162

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.index rto expect 10s but get 242.64889454841614s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.search rto expect 10s but get 18.213006496429443s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.query rto expect 10s but get 503.08261704444885s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[3-10] - pytest_assume.plugin.FailedAssumption: 

[2024-11-05T06:51:12.600Z] 4 Failed Assumptions:

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(

[2024-11-05T06:51:12.600Z] AssertionError: Expect Succ: Op.index succ rate 0.9753086419753086, total: 81, average time: 19.0024

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.index rto expect 10s but get 264.7868928909302s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.search rto expect 10s but get 503.0670373439789s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.query rto expect 10s but get 18.21220326423645s

@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Nov 5, 2024
@zhuwenxing
Copy link
Contributor Author

for standalone upgrade,only index failed

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5445/pipeline
log:
artifacts-kafka-standalone-5445-server-logs.tar.gz


[2024-11-05T07:44:42.318Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[1-10] - pytest_assume.plugin.FailedAssumption: 

[2024-11-05T07:44:42.318Z] 3 Failed Assumptions:

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(

[2024-11-05T07:44:42.318Z] AssertionError: Expect Succ: Op.index succ rate 0.25, total: 16, average time: 73.3134

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-05T07:44:42.318Z] AssertionError: IndexChecker__w3zvOK6i Op.index rto expect 10s but get 1441.7572689056396s, [('failure', 4, '2024-11-05 07:13:16.469496', 1730790796.4695222), ('failure', 5, '2024-11-05 07:15:18.316713', 1730790918.3167536), ('failure', 6, '2024-11-05 07:17:19.876768', 1730791039.8768048), ('failure', 7, '2024-11-05 07:19:20.999701', 1730791160.9997256), ('failure', 8, '2024-11-05 07:21:22.218705', 1730791282.2187417), ('failure', 9, '2024-11-05 07:23:23.670712', 1730791403.670735), ('failure', 10, '2024-11-05 07:25:24.974862', 1730791524.9748917), ('success', 11, '2024-11-05 07:27:26.117733', 1730791646.1177838), ('failure', 12, '2024-11-05 07:29:13.369737', 1730791753.369796), ('failure', 13, '2024-11-05 07:31:14.670728', 1730791874.6707768), ('failure', 14, '2024-11-05 07:33:15.822704', 1730791995.822736), ('failure', 15, '2024-11-05 07:35:16.960711', 1730792116.9607549), ('failure', 16, '2024-11-05 07:37:18.226725', 1730792238.2267911)]

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(

[2024-11-05T07:44:42.318Z] AssertionError: Expect Succ: Op.index succ rate 0.3333333333333333, total: 3, average time: 46.8264

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[8-10] - pytest_assume.plugin.FailedAssumption: 

[2024-11-05T07:44:42.318Z] 3 Failed Assumptions:

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(

[2024-11-05T07:44:42.318Z] AssertionError: Expect Succ: Op.index succ rate 0.1875, total: 16, average time: 75.0513

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-05T07:44:42.318Z] AssertionError: IndexChecker__He0D4Hs4 Op.index rto expect 10s but get 1556.2252717018127s, [('failure', 3, '2024-11-05 07:13:19.365238', 1730790799.3652759), ('failure', 4, '2024-11-05 07:15:20.702708', 1730790920.7027457), ('failure', 5, '2024-11-05 07:17:21.885715', 1730791041.885755), ('failure', 6, '2024-11-05 07:19:23.068731', 1730791163.068761), ('failure', 7, '2024-11-05 07:21:24.407175', 1730791284.407211), ('failure', 8, '2024-11-05 07:23:25.699772', 1730791405.699804), ('failure', 9, '2024-11-05 07:25:27.008729', 1730791527.0087748), ('failure', 10, '2024-11-05 07:27:28.121738', 1730791648.1217904), ('success', 11, '2024-11-05 07:29:29.502708', 1730791769.502753), ('failure', 12, '2024-11-05 07:31:10.339660', 1730791870.3396993), ('failure', 13, '2024-11-05 07:33:11.922663', 1730791991.9227295), ('failure', 14, '2024-11-05 07:35:13.253711', 1730792113.2537587), ('failure', 15, '2024-11-05 07:37:14.365715', 1730792234.3657625), ('failure', 16, '2024-11-05 07:39:15.590512', 1730792355.5905476)]

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(

[2024-11-05T07:44:42.318Z] AssertionError: Expect Succ: Op.index succ rate 0.0, total: 1, average time: 0.0000

@xiaofan-luan
Copy link
Collaborator

for rolling upgrade, 10s still seems to be a long time? 1s is expected since it is a graceful shut down

@weiliu1031
Copy link
Contributor

for rolling upgrade, 10s still seems to be a long time? 1s is expected since it is a graceful shut down

image
the recent search failure during rolling upgrade come from the mixcoord unavailable, which caused by the meta op latency.
image

the reason of high meta op latency is still in investigation

@weiliu1031
Copy link
Contributor

seems caused by the high workload on physical server, which caused etcd access issue, please verify this with same image

@weiliu1031
Copy link
Contributor

/assign @zhuwenxing

@zhuwenxing
Copy link
Contributor Author

verifying with v2.4.14 --> master-20241111-fca946de-amd64

@weiliu1031

job link: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5474/pipeline/149/

log:
artifacts-kafka-mixcoord-5474-server-logs.tar.gz


[2024-11-11T17:42:06.750Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-11T17:42:06.750Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-11T17:42:06.750Z] AssertionError: IndexChecker__Q1iIhb7N Op.index rto expect 10s but get 242.92061233520508s, [('failure', 31, '2024-11-11 17:11:23.263264', 1731345083.2634275), ('failure', 32, '2024-11-11 17:13:24.797621', 1731345204.797688), ('success', 33, '2024-11-11 17:15:26.183990', 1731345326.1840398)]

[2024-11-11T17:42:06.750Z] assert False

[2024-11-11T17:42:06.750Z] 

[2024-11-11T17:42:06.750Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-11T17:42:06.750Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-11T17:42:06.750Z] AssertionError: SearchChecker__7VvXx7SL Op.search rto expect 10s but get 40.26463198661804s, [('failure', 395, '2024-11-11 17:10:35.015949', 1731345035.0160327), ('failure', 396, '2024-11-11 17:10:42.265899', 1731345042.265943), ('failure', 397, '2024-11-11 17:10:53.269408', 1731345053.2694542), ('failure', 398, '2024-11-11 17:11:04.275540', 1731345064.2756026), ('success', 399, '2024-11-11 17:11:15.280606', 1731345075.2806647)]

[2024-11-11T17:42:06.750Z] assert False

[2024-11-11T17:42:06.750Z] 

[2024-11-11T17:42:06.750Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-11T17:42:06.750Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-11T17:42:06.750Z] AssertionError: QueryChecker__2q3AeVn7 Op.query rto expect 10s but get 44.01552772521973s, [('failure', 395, '2024-11-11 17:10:34.816652', 1731345034.8167083), ('failure', 396, '2024-11-11 17:10:45.820608', 1731345045.820646), ('failure', 397, '2024-11-11 17:10:56.823408', 1731345056.823445), ('failure', 398, '2024-11-11 17:11:07.827918', 1731345067.8279743), ('success', 399, '2024-11-11 17:11:18.832177', 1731345078.832236)]

query/search start to fail at 2024-11-11 17:10:34.816652

at this time point, mixcoord is starting to upgrade

[2024-11-11T17:10:01.776Z] [2024-11-11 17:09:54 - INFO - ci_test]: update image for component mixCoord (test_rolling_update_one_by_one.py:220)
[2024-11-11T17:10:01.776Z] [2024-11-11 17:09:54 - INFO - ci_test]: cmd: kubectl patch Milvus kafka-mixcoord-5474 --patch-file /home/jenkins/agent/workspace/tests/python_client/deploy/milvus_crd_modified.yaml --type merge (test_rolling_update_one_by_one.py:139)
[2024-11-11T17:10:01.776Z] [2024-11-11 17:09:54 - INFO - ci_test]: kubectl patch Milvus kafka-mixcoord-5474 --patch-file /home/jenkins/agent/workspace/tests/python_client/deploy/milvus_crd_modified.yaml --type merge
[2024-11-11T17:10:01.776Z] milvus.milvus.io/kafka-mixcoord-5474 patched
[2024-11-11T17:10:01.776Z] 
[2024-11-11T17:10:01.776Z]  (test_rolling_update_one_by_one.py:143)
[2024-11-11T17:10:01.776Z] [2024-11-11 17:09:54 - INFO - ci_test]: [update image for mixCoord]wait 10s after rolling update patch (test_rolling_update_one_by_one.py:232)
[2024-11-11T17:10:04.290Z] [2024-11-11 17:10:04 - INFO - ci_test]: cmd: kubectl get pod|grep kafka-mixcoord-5474 (test_rolling_update_one_by_one.py:139)
[2024-11-11T17:10:04.545Z] [2024-11-11 17:10:04 - INFO - ci_test]: kubectl get pod|grep kafka-mixcoord-5474
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-etcd-0                                       1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-etcd-1                                       1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-etcd-2                                       1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-0                                      2/2     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-1                                      2/2     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-2                                      2/2     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-exporter-66487bd99b-x9llp              1/1     Running             3 (12m ago)      13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-zookeeper-0                            1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-zookeeper-1                            1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-zookeeper-2                            1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-datanode-775996798d-89vw4             1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-datanode-775996798d-mcdxb             1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-datanode-775996798d-tk6q7             1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-indexnode-5868f7b8d6-btcvm            1/1     Running             0                4m56s
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-indexnode-5868f7b8d6-m9qlq            1/1     Running             0                4m15s
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-indexnode-5868f7b8d6-rjcrd            1/1     Running             0                3m35s
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-mixcoord-7b76cb6bb8-fn99q             0/1     Running             0                10s
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-mixcoord-84b57479b4-w2wx6             1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-proxy-7d46bd48b8-mwgp9                1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-querynode-0-6f84cf6649-2hx67          1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-querynode-0-6f84cf6649-bnh7t          1/1     Running             0                11m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-milvus-querynode-0-6f84cf6649-hm2wx          1/1     Running             0                11m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-minio-0                                      1/1     Running             0                13m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-minio-1                                      1/1     Running             0                13m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-minio-2                                      1/1     Running             0                13m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-minio-3                                      1/1     Running             0                13m

image

meta op latency increase also happened at this time

@xiaofan-luan
Copy link
Collaborator

the weird part is meta op put fail.

because even under recovery the most operation done is get or scan rather than put/txn.

@weiliu1031 I think we need more clues on:

  1. how many keys on their in the etcd? is there a frequent ddl on this cases? seems the rootcoord meta recovery is slow as well, my guess is that the snapshot kv is not cleaned as expect.
  2. we need more monitoring on the cloud console, not only the latency , but also about the QPS.
  3. need to add more logs on coord recovery and analyze on the slowest part

@weiliu1031
Copy link
Contributor

weiliu1031 commented Nov 13, 2024

kafka-mixcoord-5474

image

  1. we already have the etcd op latency/qps/throughput
  2. and there are at most 15.5k/s etcd get op during mixcoord, which is much more than expected

@liliu-z
Copy link
Member

liliu-z commented Nov 13, 2024

Can we log some of them out for debugging purpose?

@weiliu1031
Copy link
Contributor

we found that the unexpected etcd qps/latency info is not a real problem, it actually come from the dashboard misuse. the old dashboard config just compute the rate by op_type, so during rolling upgrade, during the period when 2 mixcoord exists, the dashboard will sum 2 mixcoord's metrics value. and after we correct the dashboard to histogram_quantile(0.99, sum by (le, meta_op_type, pod) (rate(milvus_meta_request_latency_bucket{app_kubernetes_io_instance=~"$instance", app_kubernetes_io_name="$app_name", namespace="$namespace"}[2m]))) the dashboard is back to normal.
image

we also can verify this dashboard issue by log, because we log all etcd slow operation with keyword Slow etcd operation. the dash board shows the highest etcd op latency is more than 2 minitues, but no Slow etcd operation shows in logs

@weiliu1031
Copy link
Contributor

back to the original issue, rootcoord cost 10s to load meta done, which will cause a 10s unavailable for mixcoord during rolling upgrade. and after we fix the etcd operation latency, we found the average get latency during rolling upgrade is about 15ms. and the cluster has 80 collections, then ListCollections need more than 240 etcd get op, which will cost more than 3.6s. so the original issue should related to collection number, and with much more collections, rootcoord may cost much more times to load meta from etcd. we will increase collection number to verify this

@xiaofan-luan
Copy link
Collaborator

回到最初的问题,rootcoord 需要 10 秒才能加载完 meta,这将导致滚动升级期间 mixcoord 需要 10 秒才能使用。在我们修复了 etcd 操作延迟之后,我们发现get latency滚动升级期间的平均延迟约为15ms。并且集群有 80 个 collection,那么ListCollections需要超过 240 个 etcd get op,这将花费超过3.6s。所以最初的问题应该与 collection 数量有关,如果 collection 数量更多,rootcoord 可能需要更多时间从 etcd 加载 meta。我们将增加 collection 数量来验证这一点

15ms seems to be too long for etcd.

if that's the problem we might need to think of concurrently recover data from etcd or lazy load

@liliu-z
Copy link
Member

liliu-z commented Nov 15, 2024

回到最初的问题,rootcoord 需要 10 秒才能加载完 meta,这将导致滚动升级期间 mixcoord 需要 10 秒才能使用。在我们修复了 etcd 操作延迟之后,我们发现get latency滚动升级期间的平均延迟约为15ms。并且集群有 80 个 collection,那么ListCollections需要超过 240 个 etcd get op,这将花费超过3.6s。所以最初的问题应该与 collection 数量有关,如果 collection 数量更多,rootcoord 可能需要更多时间从 etcd 加载 meta。我们将增加 collection 数量来验证这一点

15ms seems to be too long for etcd.

if that's the problem we might need to think of concurrently recover data from etcd or lazy load

Verified large number of collections will not trigger longer loading time. Need more logs to trouble shooting, suggest to mark it as unblocking

@xiaofan-luan
Copy link
Collaborator

ok @weiliu1031
please keep working on it since it is very important on failure recovery time and stability

@weiliu1031
Copy link
Contributor

ok @weiliu1031 please keep working on it since it is very important on failure recovery time and stability

image
use batch could solve the reload cost

@weiliu1031
Copy link
Contributor

please verify this with latest image

@weiliu1031
Copy link
Contributor

/assign @zhuwenxing

@zhuwenxing
Copy link
Contributor Author

Kafka as an mq rolling upgrade is blocked, Pulsar will be used as mq for verification.

@zhuwenxing
Copy link
Contributor Author

still reproduced in 2.4-20241120-5ac4e483-amd64 --> master-20241120-c73ca14a-amd64
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5530/pipeline
log:

artifacts-pulsar-mixcoord-5530-server-logs.tar.gz


[2024-11-21T03:00:28.514Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-21T03:00:28.514Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-21T03:00:28.514Z] AssertionError: SearchChecker__hwZSxMqL Op.search rto expect 10s but get 130.49712324142456s, [('failure', 916, '2024-11-21 02:48:08.010751', 1732157288.0107899), ('success', 917, '2024-11-21 02:48:19.015752', 1732157299.0157807), ('failure', 921, '2024-11-21 02:48:28.459805', 1732157308.4598365), ('failure', 922, '2024-11-21 02:48:39.463754', 1732157319.4637787), ('failure', 923, '2024-11-21 02:48:50.467741', 1732157330.4677663), ('failure', 924, '2024-11-21 02:49:01.471753', 1732157341.4717753), ('failure', 925, '2024-11-21 02:49:12.475750', 1732157352.4757862), ('failure', 926, '2024-11-21 02:49:23.479765', 1732157363.4797904), ('failure', 927, '2024-11-21 02:49:34.483755', 1732157374.4837914), ('failure', 928, '2024-11-21 02:49:45.487743', 1732157385.4877672), ('failure', 929, '2024-11-21 02:49:56.491737', 1732157396.4917598), ('failure', 930, '2024-11-21 02:50:07.496769', 1732157407.4967978), ('success', 931, '2024-11-21 02:50:18.507805', 1732157418.507913)]

[2024-11-21T03:00:28.514Z] assert False

[2024-11-21T03:00:28.514Z] 

[2024-11-21T03:00:28.514Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-21T03:00:28.514Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-21T03:00:28.514Z] AssertionError: QueryChecker__sFO0meaK Op.query rto expect 10s but get 548.6190268993378s, [('failure', 921, '2024-11-21 02:48:08.014770', 1732157288.0147953), ('success', 922, '2024-11-21 02:48:19.017779', 1732157299.017805), ('failure', 926, '2024-11-21 02:48:28.453833', 1732157308.4538553), ('failure', 927, '2024-11-21 02:48:39.456765', 1732157319.4567926), ('failure', 928, '2024-11-21 02:48:50.459782', 1732157330.4598181), ('failure', 929, '2024-11-21 02:49:01.463224', 1732157341.4632468), ('failure', 930, '2024-11-21 02:49:12.466047', 1732157352.4660752), ('failure', 931, '2024-11-21 02:49:23.469060', 1732157363.4690855), ('failure', 932, '2024-11-21 02:49:34.471773', 1732157374.4718187), ('failure', 933, '2024-11-21 02:49:45.474791', 1732157385.4748273), ('failure', 934, '2024-11-21 02:49:56.478008', 1732157396.4780388), ('failure', 935, '2024-11-21 02:50:07.481493', 1732157407.4815269), ('failure', 936, '2024-11-21 02:50:18.484792', 1732157418.484839), ('success', 937, '2024-11-21 02:50:29.487811', 1732157429.4878602), ('failure', 1240, '2024-11-21 02:57:16.633790', 1732157836.6338222)]

[2024-11-21T03:00:28.514Z] assert False

image

@weiliu1031
Copy link
Contributor

weiliu1031 commented Nov 21, 2024

new issue found
image
search failed due to tSafe latency
image
and send insert message to mq cost more than 2minutes, which cause the high tsafe latency

@liliu-z
Copy link
Member

liliu-z commented Nov 21, 2024

Screen Shot 2024-11-21 at 7 59 47 PM

Saw big pulsar lag but didn't see any obvious signal on resource usage side. Let's turn on pulsar's metrics and reevaluate

/assign @yanliang567 @zhuwenxing

@xiaofan-luan
Copy link
Collaborator

@zhuwenxing

check disk usage? there maybe disk performance issue on our test environment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants