[Bug]: During rolling upgrades of the query node under high-load workloads, there may be approximately 100 seconds of interruption in search/query services. #36228

zhuwenxing · 2024-09-13T02:39:05Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:2.4.3--> master-20240912-e8840a1b-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-09-12T17:40:24.392Z] FAILED testcases/test_concurrent_request_operation_for_rolling_update.py::TestOperations::test_operations - pytest_assume.plugin.FailedAssumption: 

[2024-09-12T17:40:24.392Z] 2 Failed Assumptions:

[2024-09-12T17:40:24.392Z] 

[2024-09-12T17:40:24.392Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-09-12T17:40:24.392Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:40:24.392Z] AssertionError: Op.search rto expect 10s but get 106.86897778511047s

[2024-09-12T17:40:24.393Z] assert False

[2024-09-12T17:40:24.393Z] 

[2024-09-12T17:40:24.393Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-09-12T17:40:24.393Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:40:24.393Z] AssertionError: Op.query rto expect 10s but get 106.65217280387878s

[2024-09-12T17:40:24.393Z] assert False

[2024-09-12T17:40:24.393Z] ================== 1 failed, 1 warning in 1990.94s (0:33:10) ===================

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/4837/pipeline

log:

kafka mixcoord 4837 server logs.tar.gz

cluster:4am
ns: chaos-testing
pod:

[2024-09-12T17:42:56.171Z] + kubectl get pods -o wide

[2024-09-12T17:42:56.173Z] + grep kafka-mixcoord-4837

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-etcd-0                                       1/1     Running       0               40m     10.104.18.121   4am-node25   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-etcd-1                                       1/1     Running       0               40m     10.104.19.203   4am-node28   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-etcd-2                                       1/1     Running       0               40m     10.104.23.183   4am-node27   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-0                                      2/2     Running       1 (39m ago)     40m     10.104.18.123   4am-node25   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-1                                      2/2     Running       1 (39m ago)     40m     10.104.23.187   4am-node27   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-2                                      2/2     Running       1 (39m ago)     40m     10.104.26.192   4am-node32   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-exporter-c56d4d748-ph7zp               1/1     Running       4 (39m ago)     40m     10.104.13.2     4am-node16   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-zookeeper-0                            1/1     Running       0               40m     10.104.20.29    4am-node22   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-zookeeper-1                            1/1     Running       0               40m     10.104.23.189   4am-node27   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-kafka-zookeeper-2                            1/1     Running       0               40m     10.104.18.125   4am-node25   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-milvus-datanode-6df9d566b7-g667b             1/1     Running       0               23m     10.104.17.205   4am-node23   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-milvus-datanode-6df9d566b7-nxv7r             1/1     Running       0               22m     10.104.6.82     4am-node13   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-milvus-datanode-6df9d566b7-pn76j             1/1     Running       0               24m     10.104.25.182   4am-node30   <none>           <none>

[2024-09-12T17:42:56.173Z] kafka-mixcoord-4837-milvus-indexnode-7ddd7959dc-66784            1/1     Running       0               31m     10.104.6.72     4am-node13   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-indexnode-7ddd7959dc-9dsrj            1/1     Running       0               31m     10.104.20.36    4am-node22   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-indexnode-7ddd7959dc-sxpps            1/1     Running       0               30m     10.104.17.197   4am-node23   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-mixcoord-794b46bcd5-rxdh2             1/1     Running       1 (26m ago)     27m     10.104.25.173   4am-node30   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-proxy-5fff9fdcb7-kljbb                1/1     Running       0               22m     10.104.6.81     4am-node13   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-querynode-1-698bdc6bdc-2znhb          1/1     Running       0               25m     10.104.25.179   4am-node30   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-querynode-1-698bdc6bdc-96ncx          1/1     Running       0               23m     10.104.6.80     4am-node13   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-milvus-querynode-1-698bdc6bdc-tf8ml          1/1     Running       0               24m     10.104.17.204   4am-node23   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-minio-0                                      1/1     Running       0               40m     10.104.20.28    4am-node22   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-minio-1                                      1/1     Running       0               40m     10.104.19.204   4am-node28   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-minio-2                                      1/1     Running       0               40m     10.104.18.124   4am-node25   <none>           <none>

[2024-09-12T17:42:56.174Z] kafka-mixcoord-4837-minio-3                                      1/1     Running       0               40m     10.104.23.182   4am-node27   <none>           <none>

Anything else?

No response

The text was updated successfully, but these errors were encountered:

zhuwenxing · 2024-09-13T02:39:38Z

/assign @weiliu1031
PTAL

yanliang567 · 2024-09-13T02:48:41Z

/unassign

zhuwenxing · 2024-09-13T02:51:24Z

same for creating index

[2024-09-12T17:42:53.629Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-09-12T17:42:53.629Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:42:53.629Z] AssertionError: Op.insert rto expect 10s but get 56.61560535430908s

[2024-09-12T17:42:53.629Z] assert False

[2024-09-12T17:42:53.629Z] 

[2024-09-12T17:42:53.629Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-09-12T17:42:53.629Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:42:53.629Z] AssertionError: Op.index rto expect 10s but get 856.4799315929413s

[2024-09-12T17:42:53.629Z] assert False

[2024-09-12T17:42:53.629Z] 

[2024-09-12T17:42:53.629Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-09-12T17:42:53.629Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:42:53.629Z] AssertionError: Op.search rto expect 10s but get 106.47509121894836s

[2024-09-12T17:42:53.629Z] assert False

[2024-09-12T17:42:53.629Z] 

[2024-09-12T17:42:53.629Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-09-12T17:42:53.629Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-09-12T17:42:53.629Z] AssertionError: Op.query rto expect 10s but get 106.66716194152832s

zhuwenxing · 2024-09-13T06:49:21Z

this also reproduced when upgrading from v2.4.5 to 2.4-20240912-ab31728b-amd64
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/4834/pipeline
log:

artifacts-kafka-mixcoord-4834-server-logs.tar.gz

zhuwenxing · 2024-11-01T09:45:13Z

In the recent 2.4.10 --> master-20241028-fc69df44-amd64 mixcoord rolling upgrade, the interruption time for search and query was around 11s.


[2024-10-28T17:41:02.697Z] FAILED testcases/test_concurrent_request_operation_for_rolling_update.py::TestOperations::test_operations - pytest_assume.plugin.FailedAssumption: 

[2024-10-28T17:41:02.697Z] 2 Failed Assumptions:

[2024-10-28T17:41:02.697Z] 

[2024-10-28T17:41:02.697Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-10-28T17:41:02.697Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-10-28T17:41:02.697Z] AssertionError: Op.search rto expect 10s but get 11.00616192817688s

[2024-10-28T17:41:02.697Z] assert False

[2024-10-28T17:41:02.697Z] 

[2024-10-28T17:41:02.697Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-10-28T17:41:02.697Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-10-28T17:41:02.697Z] AssertionError: Op.query rto expect 10s but get 11.004719734191895s

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5410/pipeline

log:
artifacts-kafka-mixcoord-5410-server-logs.tar.gz

@weiliu1031
I feel that this result can still be optimized.

weiliu1031 · 2024-11-04T06:25:59Z

In the recent 2.4.10 --> master-20241028-fc69df44-amd64 mixcoord rolling upgrade, the interruption time for search and query was around 11s.


[2024-10-28T17:41:02.697Z] FAILED testcases/test_concurrent_request_operation_for_rolling_update.py::TestOperations::test_operations - pytest_assume.plugin.FailedAssumption: 

[2024-10-28T17:41:02.697Z] 2 Failed Assumptions:

[2024-10-28T17:41:02.697Z] 

[2024-10-28T17:41:02.697Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-10-28T17:41:02.697Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-10-28T17:41:02.697Z] AssertionError: Op.search rto expect 10s but get 11.00616192817688s

[2024-10-28T17:41:02.697Z] assert False

[2024-10-28T17:41:02.697Z] 

[2024-10-28T17:41:02.697Z] testcases/test_concurrent_request_operation_for_rolling_update.py:125: AssumptionFailure

[2024-10-28T17:41:02.697Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-10-28T17:41:02.697Z] AssertionError: Op.query rto expect 10s but get 11.004719734191895s

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5410/pipeline

log: artifacts-kafka-mixcoord-5410-server-logs.tar.gz

@weiliu1031 I feel that this result can still be optimized.

known issue, should be fixed by #36880
please verify this with the the commit after this pr

weiliu1031 · 2024-11-04T06:26:07Z

/assign @zhuwenxing

zhuwenxing · 2024-11-05T07:41:35Z

still reproduced in v2.4.14 --> master-20241105-b83b376c-amd64

and downtime increased

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5443/pipeline

log:

artifacts-kafka-mixcoord-5443-server-logs.tar.gz


[2024-11-05T06:51:12.600Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[7-10] - pytest_assume.plugin.FailedAssumption: 

[2024-11-05T06:51:12.600Z] 4 Failed Assumptions:

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(

[2024-11-05T06:51:12.600Z] AssertionError: Expect Succ: Op.index succ rate 0.9759036144578314, total: 83, average time: 18.9162

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.index rto expect 10s but get 242.64889454841614s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.search rto expect 10s but get 18.213006496429443s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.query rto expect 10s but get 503.08261704444885s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[3-10] - pytest_assume.plugin.FailedAssumption: 

[2024-11-05T06:51:12.600Z] 4 Failed Assumptions:

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(

[2024-11-05T06:51:12.600Z] AssertionError: Expect Succ: Op.index succ rate 0.9753086419753086, total: 81, average time: 19.0024

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.index rto expect 10s but get 264.7868928909302s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.search rto expect 10s but get 503.0670373439789s

[2024-11-05T06:51:12.600Z] assert False

[2024-11-05T06:51:12.600Z] 

[2024-11-05T06:51:12.600Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T06:51:12.600Z] >>	pytest.assume(rto <= rto_threshold,  f"{k} rto expect {rto_threshold}s but get {rto}s")

[2024-11-05T06:51:12.600Z] AssertionError: Op.query rto expect 10s but get 18.21220326423645s

zhuwenxing · 2024-11-05T07:51:58Z

for standalone upgrade，only index failed

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5445/pipeline
log:
artifacts-kafka-standalone-5445-server-logs.tar.gz


[2024-11-05T07:44:42.318Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[1-10] - pytest_assume.plugin.FailedAssumption: 

[2024-11-05T07:44:42.318Z] 3 Failed Assumptions:

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(

[2024-11-05T07:44:42.318Z] AssertionError: Expect Succ: Op.index succ rate 0.25, total: 16, average time: 73.3134

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-05T07:44:42.318Z] AssertionError: IndexChecker__w3zvOK6i Op.index rto expect 10s but get 1441.7572689056396s, [('failure', 4, '2024-11-05 07:13:16.469496', 1730790796.4695222), ('failure', 5, '2024-11-05 07:15:18.316713', 1730790918.3167536), ('failure', 6, '2024-11-05 07:17:19.876768', 1730791039.8768048), ('failure', 7, '2024-11-05 07:19:20.999701', 1730791160.9997256), ('failure', 8, '2024-11-05 07:21:22.218705', 1730791282.2187417), ('failure', 9, '2024-11-05 07:23:23.670712', 1730791403.670735), ('failure', 10, '2024-11-05 07:25:24.974862', 1730791524.9748917), ('success', 11, '2024-11-05 07:27:26.117733', 1730791646.1177838), ('failure', 12, '2024-11-05 07:29:13.369737', 1730791753.369796), ('failure', 13, '2024-11-05 07:31:14.670728', 1730791874.6707768), ('failure', 14, '2024-11-05 07:33:15.822704', 1730791995.822736), ('failure', 15, '2024-11-05 07:35:16.960711', 1730792116.9607549), ('failure', 16, '2024-11-05 07:37:18.226725', 1730792238.2267911)]

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(

[2024-11-05T07:44:42.318Z] AssertionError: Expect Succ: Op.index succ rate 0.3333333333333333, total: 3, average time: 46.8264

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[8-10] - pytest_assume.plugin.FailedAssumption: 

[2024-11-05T07:44:42.318Z] 3 Failed Assumptions:

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(

[2024-11-05T07:44:42.318Z] AssertionError: Expect Succ: Op.index succ rate 0.1875, total: 16, average time: 75.0513

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-05T07:44:42.318Z] AssertionError: IndexChecker__He0D4Hs4 Op.index rto expect 10s but get 1556.2252717018127s, [('failure', 3, '2024-11-05 07:13:19.365238', 1730790799.3652759), ('failure', 4, '2024-11-05 07:15:20.702708', 1730790920.7027457), ('failure', 5, '2024-11-05 07:17:21.885715', 1730791041.885755), ('failure', 6, '2024-11-05 07:19:23.068731', 1730791163.068761), ('failure', 7, '2024-11-05 07:21:24.407175', 1730791284.407211), ('failure', 8, '2024-11-05 07:23:25.699772', 1730791405.699804), ('failure', 9, '2024-11-05 07:25:27.008729', 1730791527.0087748), ('failure', 10, '2024-11-05 07:27:28.121738', 1730791648.1217904), ('success', 11, '2024-11-05 07:29:29.502708', 1730791769.502753), ('failure', 12, '2024-11-05 07:31:10.339660', 1730791870.3396993), ('failure', 13, '2024-11-05 07:33:11.922663', 1730791991.9227295), ('failure', 14, '2024-11-05 07:35:13.253711', 1730792113.2537587), ('failure', 15, '2024-11-05 07:37:14.365715', 1730792234.3657625), ('failure', 16, '2024-11-05 07:39:15.590512', 1730792355.5905476)]

[2024-11-05T07:44:42.318Z] assert False

[2024-11-05T07:44:42.318Z] 

[2024-11-05T07:44:42.318Z] chaos_commons.py:124: AssumptionFailure

[2024-11-05T07:44:42.318Z] >>	pytest.assume(

[2024-11-05T07:44:42.318Z] AssertionError: Expect Succ: Op.index succ rate 0.0, total: 1, average time: 0.0000

xiaofan-luan · 2024-11-08T17:32:19Z

for rolling upgrade, 10s still seems to be a long time? 1s is expected since it is a graceful shut down

weiliu1031 · 2024-11-11T03:30:58Z

for rolling upgrade, 10s still seems to be a long time? 1s is expected since it is a graceful shut down

the recent search failure during rolling upgrade come from the mixcoord unavailable, which caused by the meta op latency.

the reason of high meta op latency is still in investigation

weiliu1031 · 2024-11-12T02:48:49Z

seems caused by the high workload on physical server, which caused etcd access issue, please verify this with same image

weiliu1031 · 2024-11-12T02:48:56Z

/assign @zhuwenxing

zhuwenxing · 2024-11-12T03:45:05Z

verifying with v2.4.14 --> master-20241111-fca946de-amd64

@weiliu1031

job link: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5474/pipeline/149/

log:
artifacts-kafka-mixcoord-5474-server-logs.tar.gz


[2024-11-11T17:42:06.750Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-11T17:42:06.750Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-11T17:42:06.750Z] AssertionError: IndexChecker__Q1iIhb7N Op.index rto expect 10s but get 242.92061233520508s, [('failure', 31, '2024-11-11 17:11:23.263264', 1731345083.2634275), ('failure', 32, '2024-11-11 17:13:24.797621', 1731345204.797688), ('success', 33, '2024-11-11 17:15:26.183990', 1731345326.1840398)]

[2024-11-11T17:42:06.750Z] assert False

[2024-11-11T17:42:06.750Z] 

[2024-11-11T17:42:06.750Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-11T17:42:06.750Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-11T17:42:06.750Z] AssertionError: SearchChecker__7VvXx7SL Op.search rto expect 10s but get 40.26463198661804s, [('failure', 395, '2024-11-11 17:10:35.015949', 1731345035.0160327), ('failure', 396, '2024-11-11 17:10:42.265899', 1731345042.265943), ('failure', 397, '2024-11-11 17:10:53.269408', 1731345053.2694542), ('failure', 398, '2024-11-11 17:11:04.275540', 1731345064.2756026), ('success', 399, '2024-11-11 17:11:15.280606', 1731345075.2806647)]

[2024-11-11T17:42:06.750Z] assert False

[2024-11-11T17:42:06.750Z] 

[2024-11-11T17:42:06.750Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-11T17:42:06.750Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-11T17:42:06.750Z] AssertionError: QueryChecker__2q3AeVn7 Op.query rto expect 10s but get 44.01552772521973s, [('failure', 395, '2024-11-11 17:10:34.816652', 1731345034.8167083), ('failure', 396, '2024-11-11 17:10:45.820608', 1731345045.820646), ('failure', 397, '2024-11-11 17:10:56.823408', 1731345056.823445), ('failure', 398, '2024-11-11 17:11:07.827918', 1731345067.8279743), ('success', 399, '2024-11-11 17:11:18.832177', 1731345078.832236)]

query/search start to fail at 2024-11-11 17:10:34.816652

at this time point, mixcoord is starting to upgrade

[2024-11-11T17:10:01.776Z] [2024-11-11 17:09:54 - INFO - ci_test]: update image for component mixCoord (test_rolling_update_one_by_one.py:220)
[2024-11-11T17:10:01.776Z] [2024-11-11 17:09:54 - INFO - ci_test]: cmd: kubectl patch Milvus kafka-mixcoord-5474 --patch-file /home/jenkins/agent/workspace/tests/python_client/deploy/milvus_crd_modified.yaml --type merge (test_rolling_update_one_by_one.py:139)
[2024-11-11T17:10:01.776Z] [2024-11-11 17:09:54 - INFO - ci_test]: kubectl patch Milvus kafka-mixcoord-5474 --patch-file /home/jenkins/agent/workspace/tests/python_client/deploy/milvus_crd_modified.yaml --type merge
[2024-11-11T17:10:01.776Z] milvus.milvus.io/kafka-mixcoord-5474 patched
[2024-11-11T17:10:01.776Z] 
[2024-11-11T17:10:01.776Z]  (test_rolling_update_one_by_one.py:143)
[2024-11-11T17:10:01.776Z] [2024-11-11 17:09:54 - INFO - ci_test]: [update image for mixCoord]wait 10s after rolling update patch (test_rolling_update_one_by_one.py:232)
[2024-11-11T17:10:04.290Z] [2024-11-11 17:10:04 - INFO - ci_test]: cmd: kubectl get pod|grep kafka-mixcoord-5474 (test_rolling_update_one_by_one.py:139)
[2024-11-11T17:10:04.545Z] [2024-11-11 17:10:04 - INFO - ci_test]: kubectl get pod|grep kafka-mixcoord-5474
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-etcd-0                                       1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-etcd-1                                       1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-etcd-2                                       1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-0                                      2/2     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-1                                      2/2     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-2                                      2/2     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-exporter-66487bd99b-x9llp              1/1     Running             3 (12m ago)      13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-zookeeper-0                            1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-zookeeper-1                            1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-kafka-zookeeper-2                            1/1     Running             0                13m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-datanode-775996798d-89vw4             1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-datanode-775996798d-mcdxb             1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-datanode-775996798d-tk6q7             1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-indexnode-5868f7b8d6-btcvm            1/1     Running             0                4m56s
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-indexnode-5868f7b8d6-m9qlq            1/1     Running             0                4m15s
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-indexnode-5868f7b8d6-rjcrd            1/1     Running             0                3m35s
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-mixcoord-7b76cb6bb8-fn99q             0/1     Running             0                10s
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-mixcoord-84b57479b4-w2wx6             1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-proxy-7d46bd48b8-mwgp9                1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-querynode-0-6f84cf6649-2hx67          1/1     Running             0                11m
[2024-11-11T17:10:04.545Z] kafka-mixcoord-5474-milvus-querynode-0-6f84cf6649-bnh7t          1/1     Running             0                11m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-milvus-querynode-0-6f84cf6649-hm2wx          1/1     Running             0                11m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-minio-0                                      1/1     Running             0                13m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-minio-1                                      1/1     Running             0                13m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-minio-2                                      1/1     Running             0                13m
[2024-11-11T17:10:04.546Z] kafka-mixcoord-5474-minio-3                                      1/1     Running             0                13m

meta op latency increase also happened at this time

xiaofan-luan · 2024-11-12T19:55:39Z

the weird part is meta op put fail.

because even under recovery the most operation done is get or scan rather than put/txn.

@weiliu1031 I think we need more clues on:

how many keys on their in the etcd? is there a frequent ddl on this cases? seems the rootcoord meta recovery is slow as well, my guess is that the snapshot kv is not cleaned as expect.
we need more monitoring on the cloud console, not only the latency , but also about the QPS.
need to add more logs on coord recovery and analyze on the slowest part

weiliu1031 · 2024-11-13T10:01:04Z

kafka-mixcoord-5474

we already have the etcd op latency/qps/throughput
and there are at most 15.5k/s etcd get op during mixcoord, which is much more than expected

liliu-z · 2024-11-13T13:58:55Z

Can we log some of them out for debugging purpose?

weiliu1031 · 2024-11-14T06:45:56Z

we found that the unexpected etcd qps/latency info is not a real problem, it actually come from the dashboard misuse. the old dashboard config just compute the rate by op_type, so during rolling upgrade, during the period when 2 mixcoord exists, the dashboard will sum 2 mixcoord's metrics value. and after we correct the dashboard to histogram_quantile(0.99, sum by (le, meta_op_type, pod) (rate(milvus_meta_request_latency_bucket{app_kubernetes_io_instance=~"$instance", app_kubernetes_io_name="$app_name", namespace="$namespace"}[2m]))) the dashboard is back to normal.

we also can verify this dashboard issue by log, because we log all etcd slow operation with keyword Slow etcd operation. the dash board shows the highest etcd op latency is more than 2 minitues, but no Slow etcd operation shows in logs

weiliu1031 · 2024-11-14T06:56:27Z

back to the original issue, rootcoord cost 10s to load meta done, which will cause a 10s unavailable for mixcoord during rolling upgrade. and after we fix the etcd operation latency, we found the average get latency during rolling upgrade is about 15ms. and the cluster has 80 collections, then ListCollections need more than 240 etcd get op, which will cost more than 3.6s. so the original issue should related to collection number, and with much more collections, rootcoord may cost much more times to load meta from etcd. we will increase collection number to verify this

xiaofan-luan · 2024-11-15T02:02:06Z

回到最初的问题，rootcoord 需要 10 秒才能加载完 meta，这将导致滚动升级期间 mixcoord 需要 10 秒才能使用。在我们修复了 etcd 操作延迟之后，我们发现get latency滚动升级期间的平均延迟约为15ms。并且集群有 80 个 collection，那么ListCollections需要超过 240 个 etcd get op，这将花费超过3.6s。所以最初的问题应该与 collection 数量有关，如果 collection 数量更多，rootcoord 可能需要更多时间从 etcd 加载 meta。我们将增加 collection 数量来验证这一点

15ms seems to be too long for etcd.

if that's the problem we might need to think of concurrently recover data from etcd or lazy load

liliu-z · 2024-11-15T03:03:50Z

回到最初的问题，rootcoord 需要 10 秒才能加载完 meta，这将导致滚动升级期间 mixcoord 需要 10 秒才能使用。在我们修复了 etcd 操作延迟之后，我们发现get latency滚动升级期间的平均延迟约为15ms。并且集群有 80 个 collection，那么ListCollections需要超过 240 个 etcd get op，这将花费超过3.6s。所以最初的问题应该与 collection 数量有关，如果 collection 数量更多，rootcoord 可能需要更多时间从 etcd 加载 meta。我们将增加 collection 数量来验证这一点

15ms seems to be too long for etcd.

if that's the problem we might need to think of concurrently recover data from etcd or lazy load

Verified large number of collections will not trigger longer loading time. Need more logs to trouble shooting, suggest to mark it as unblocking

xiaofan-luan · 2024-11-15T20:40:41Z

ok @weiliu1031
please keep working on it since it is very important on failure recovery time and stability

weiliu1031 · 2024-11-16T09:47:50Z

ok @weiliu1031 please keep working on it since it is very important on failure recovery time and stability

use batch could solve the reload cost

issue: #36228 pr: #37742 Signed-off-by: Wei Liu <[email protected]>

issue: #36228 Signed-off-by: Wei Liu <[email protected]>

weiliu1031 · 2024-11-18T11:01:59Z

please verify this with latest image

weiliu1031 · 2024-11-18T11:02:05Z

/assign @zhuwenxing

zhuwenxing · 2024-11-21T02:31:32Z

Kafka as an mq rolling upgrade is blocked, Pulsar will be used as mq for verification.

zhuwenxing · 2024-11-21T05:18:34Z

still reproduced in 2.4-20241120-5ac4e483-amd64 --> master-20241120-c73ca14a-amd64
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5530/pipeline
log:

artifacts-pulsar-mixcoord-5530-server-logs.tar.gz


[2024-11-21T03:00:28.514Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-21T03:00:28.514Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-21T03:00:28.514Z] AssertionError: SearchChecker__hwZSxMqL Op.search rto expect 10s but get 130.49712324142456s, [('failure', 916, '2024-11-21 02:48:08.010751', 1732157288.0107899), ('success', 917, '2024-11-21 02:48:19.015752', 1732157299.0157807), ('failure', 921, '2024-11-21 02:48:28.459805', 1732157308.4598365), ('failure', 922, '2024-11-21 02:48:39.463754', 1732157319.4637787), ('failure', 923, '2024-11-21 02:48:50.467741', 1732157330.4677663), ('failure', 924, '2024-11-21 02:49:01.471753', 1732157341.4717753), ('failure', 925, '2024-11-21 02:49:12.475750', 1732157352.4757862), ('failure', 926, '2024-11-21 02:49:23.479765', 1732157363.4797904), ('failure', 927, '2024-11-21 02:49:34.483755', 1732157374.4837914), ('failure', 928, '2024-11-21 02:49:45.487743', 1732157385.4877672), ('failure', 929, '2024-11-21 02:49:56.491737', 1732157396.4917598), ('failure', 930, '2024-11-21 02:50:07.496769', 1732157407.4967978), ('success', 931, '2024-11-21 02:50:18.507805', 1732157418.507913)]

[2024-11-21T03:00:28.514Z] assert False

[2024-11-21T03:00:28.514Z] 

[2024-11-21T03:00:28.514Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-11-21T03:00:28.514Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-11-21T03:00:28.514Z] AssertionError: QueryChecker__sFO0meaK Op.query rto expect 10s but get 548.6190268993378s, [('failure', 921, '2024-11-21 02:48:08.014770', 1732157288.0147953), ('success', 922, '2024-11-21 02:48:19.017779', 1732157299.017805), ('failure', 926, '2024-11-21 02:48:28.453833', 1732157308.4538553), ('failure', 927, '2024-11-21 02:48:39.456765', 1732157319.4567926), ('failure', 928, '2024-11-21 02:48:50.459782', 1732157330.4598181), ('failure', 929, '2024-11-21 02:49:01.463224', 1732157341.4632468), ('failure', 930, '2024-11-21 02:49:12.466047', 1732157352.4660752), ('failure', 931, '2024-11-21 02:49:23.469060', 1732157363.4690855), ('failure', 932, '2024-11-21 02:49:34.471773', 1732157374.4718187), ('failure', 933, '2024-11-21 02:49:45.474791', 1732157385.4748273), ('failure', 934, '2024-11-21 02:49:56.478008', 1732157396.4780388), ('failure', 935, '2024-11-21 02:50:07.481493', 1732157407.4815269), ('failure', 936, '2024-11-21 02:50:18.484792', 1732157418.484839), ('success', 937, '2024-11-21 02:50:29.487811', 1732157429.4878602), ('failure', 1240, '2024-11-21 02:57:16.633790', 1732157836.6338222)]

[2024-11-21T03:00:28.514Z] assert False

weiliu1031 · 2024-11-21T06:26:27Z

new issue found

search failed due to tSafe latency

and send insert message to mq cost more than 2minutes, which cause the high tsafe latency

liliu-z · 2024-11-21T12:00:48Z

Saw big pulsar lag but didn't see any obvious signal on resource usage side. Let's turn on pulsar's metrics and reevaluate

/assign @yanliang567 @zhuwenxing

xiaofan-luan · 2024-11-21T23:42:24Z

@zhuwenxing

check disk usage? there maybe disk performance issue on our test environment

zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 13, 2024

zhuwenxing assigned yanliang567 Sep 13, 2024

sre-ci-robot assigned weiliu1031 Sep 13, 2024

sre-ci-robot unassigned yanliang567 Sep 13, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 13, 2024

yanliang567 added this to the 2.5.0 milestone Sep 13, 2024

sre-ci-robot assigned zhuwenxing Nov 4, 2024

zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Nov 5, 2024

yanliang567 unassigned zhuwenxing Nov 5, 2024

sre-ci-robot assigned zhuwenxing Nov 12, 2024

yanliang567 unassigned zhuwenxing Nov 13, 2024

weiliu1031 mentioned this issue Nov 16, 2024

enhance: Use batch to speed up list collections from meta kv #37742

Merged

weiliu1031 mentioned this issue Nov 17, 2024

enhance: Use batch to speed up list collections from meta kv #37752

Merged

sre-ci-robot pushed a commit that referenced this issue Nov 18, 2024

enhance: Use batch to speed up list collections from meta kv (#37752)

79f676e

issue: #36228 pr: #37742 Signed-off-by: Wei Liu <[email protected]>

sre-ci-robot pushed a commit that referenced this issue Nov 18, 2024

enhance: Use batch to speed up list collections from meta kv (#37742)

226fe90

issue: #36228 Signed-off-by: Wei Liu <[email protected]>

sre-ci-robot assigned zhuwenxing Nov 18, 2024

sre-ci-robot assigned yanliang567 Nov 21, 2024

liliu-z mentioned this issue Nov 21, 2024

[Bug]: After Milvus recovered from standalone pod failure chaos, many interface responses became significantly slower, and listing collections directly resulted in errors #37828

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: During rolling upgrades of the query node under high-load workloads, there may be approximately 100 seconds of interruption in search/query services. #36228

[Bug]: During rolling upgrades of the query node under high-load workloads, there may be approximately 100 seconds of interruption in search/query services. #36228

zhuwenxing commented Sep 13, 2024

zhuwenxing commented Sep 13, 2024

yanliang567 commented Sep 13, 2024

zhuwenxing commented Sep 13, 2024

zhuwenxing commented Sep 13, 2024

zhuwenxing commented Nov 1, 2024

weiliu1031 commented Nov 4, 2024

weiliu1031 commented Nov 4, 2024

zhuwenxing commented Nov 5, 2024

zhuwenxing commented Nov 5, 2024

xiaofan-luan commented Nov 8, 2024

weiliu1031 commented Nov 11, 2024

weiliu1031 commented Nov 12, 2024

weiliu1031 commented Nov 12, 2024

zhuwenxing commented Nov 12, 2024

xiaofan-luan commented Nov 12, 2024

weiliu1031 commented Nov 13, 2024 •

edited

Loading

liliu-z commented Nov 13, 2024 •

edited

Loading

weiliu1031 commented Nov 14, 2024

weiliu1031 commented Nov 14, 2024

xiaofan-luan commented Nov 15, 2024

liliu-z commented Nov 15, 2024

xiaofan-luan commented Nov 15, 2024

weiliu1031 commented Nov 16, 2024

weiliu1031 commented Nov 18, 2024

weiliu1031 commented Nov 18, 2024

zhuwenxing commented Nov 21, 2024

zhuwenxing commented Nov 21, 2024

weiliu1031 commented Nov 21, 2024 •

edited

Loading

liliu-z commented Nov 21, 2024 •

edited

Loading

xiaofan-luan commented Nov 21, 2024

[Bug]: During rolling upgrades of the query node under high-load workloads, there may be approximately 100 seconds of interruption in search/query services. #36228

[Bug]: During rolling upgrades of the query node under high-load workloads, there may be approximately 100 seconds of interruption in search/query services. #36228

Comments

zhuwenxing commented Sep 13, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

zhuwenxing commented Sep 13, 2024

yanliang567 commented Sep 13, 2024

zhuwenxing commented Sep 13, 2024

zhuwenxing commented Sep 13, 2024

zhuwenxing commented Nov 1, 2024

weiliu1031 commented Nov 4, 2024

weiliu1031 commented Nov 4, 2024

zhuwenxing commented Nov 5, 2024

zhuwenxing commented Nov 5, 2024

xiaofan-luan commented Nov 8, 2024

weiliu1031 commented Nov 11, 2024

weiliu1031 commented Nov 12, 2024

weiliu1031 commented Nov 12, 2024

zhuwenxing commented Nov 12, 2024

xiaofan-luan commented Nov 12, 2024

weiliu1031 commented Nov 13, 2024 • edited Loading

liliu-z commented Nov 13, 2024 • edited Loading

weiliu1031 commented Nov 14, 2024

weiliu1031 commented Nov 14, 2024

xiaofan-luan commented Nov 15, 2024

liliu-z commented Nov 15, 2024

xiaofan-luan commented Nov 15, 2024

weiliu1031 commented Nov 16, 2024

weiliu1031 commented Nov 18, 2024

weiliu1031 commented Nov 18, 2024

zhuwenxing commented Nov 21, 2024

zhuwenxing commented Nov 21, 2024

weiliu1031 commented Nov 21, 2024 • edited Loading

liliu-z commented Nov 21, 2024 • edited Loading

xiaofan-luan commented Nov 21, 2024

weiliu1031 commented Nov 13, 2024 •

edited

Loading

liliu-z commented Nov 13, 2024 •

edited

Loading

weiliu1031 commented Nov 21, 2024 •

edited

Loading

liliu-z commented Nov 21, 2024 •

edited

Loading