Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: After Milvus recovered from standalone pod failure chaos, many interface responses became significantly slower, and listing collections directly resulted in errors #37828

Open
1 task done
zhuwenxing opened this issue Nov 20, 2024 · 5 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20241119-fceff6ed-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

image


[2024-11-19T08:30:09.064Z] [2024-11-19 08:29:28 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': '10.255.245.8', 'port': 19530} (api_request.py:62)

[2024-11-19T08:30:09.064Z] [2024-11-19 08:29:28 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-11-19T08:30:09.064Z] [2024-11-19 08:29:28 - INFO - ci_test]: server version: fceff6e (client_base.py:166)

[2024-11-19T08:30:09.064Z] [2024-11-19 08:29:28 - DEBUG - ci_test]: (api_request)  : [list_collections] args: [20, 'default'], kwargs: {} (api_request.py:62)

[2024-11-19T08:30:09.064Z] [2024-11-19 08:29:48 - ERROR - pymilvus.decorators]: grpc RpcError: [list_collections], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-11-19 08:29:28.689044', 'gRPC error': '2024-11-19 08:29:48.690474'}> (decorators.py:151)

[2024-11-19T08:30:09.064Z] [2024-11-19 08:29:48 - ERROR - ci_test]: Traceback (most recent call last):

[2024-11-19T08:30:09.064Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-11-19T08:30:09.064Z]     res = func(*args, **_kwargs)

[2024-11-19T08:30:09.064Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-11-19T08:30:09.064Z]     return func(*arg, **kwargs)

[2024-11-19T08:30:09.064Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/utility.py", line 558, in list_collections

[2024-11-19T08:30:09.064Z]     return _get_connection(using).list_collections(timeout=timeout)

[2024-11-19T08:30:09.064Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 155, in handler

[2024-11-19T08:30:09.064Z]     raise e from e

[2024-11-19T08:30:09.064Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 137, in handler

[2024-11-19T08:30:09.064Z]     return func(*args, **kwargs)

[2024-11-19T08:30:09.064Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 176, in handler

[2024-11-19T08:30:09.064Z]     return func(self, *args, **kwargs)

[2024-11-19T08:30:09.064Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 90, in handler

[2024-11-19T08:30:09.064Z]     raise e from e

[2024-11-19T08:30:09.064Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler

[2024-11-19T08:30:09.064Z]     return func(*args, **kwargs)

[2024-11-19T08:30:09.064Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 373, in list_collections

[2024-11-19T08:30:09.064Z]     response = rf.result()

[2024-11-19T08:30:09.064Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 881, in result

[2024-11-19T08:30:09.064Z]     raise self

[2024-11-19T08:30:09.064Z] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:

[2024-11-19T08:30:09.064Z] 	status = StatusCode.DEADLINE_EXCEEDED

[2024-11-19T08:30:09.064Z] 	details = "Deadline Exceeded"

[2024-11-19T08:30:09.064Z] 	debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-11-19T08:29:48.68996819+00:00", grpc_status:4}"

[2024-11-19T08:30:09.064Z] >

[2024-11-19T08:30:09.064Z]  (api_request.py:45)

[2024-11-19T08:30:09.064Z] [2024-11-19 08:29:48 - ERROR - ci_test]: (api_response) : <_MultiThreadedRendezvous of RPC that terminated with:

[2024-11-19T08:30:09.064Z] 	status = StatusCode.DEADLINE_EXCEEDED

[2024-11-19T08:30:09.064Z] 	details = "Deadline Exceeded"

[2024-11-19T08:30:09.064Z] 	debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-11-19T08:29:48.68996819+00:00", grpc_status:4}"

[2024-11-19T08:30:09.064Z] > (api_request.py:46)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/18586/pipeline

log:
artifacts-standalone-pod-failure-18586-server-logs.tar.gz

cluster: 4am
ns: chaos-testing
pod info

[2024-11-19T08:29:16.601Z] + kubectl get pods -o wide

[2024-11-19T08:29:16.605Z] + grep standalone-pod-failure-18586

[2024-11-19T08:29:17.167Z] standalone-pod-failure-18586-etcd-0                               1/1     Running            0                 58m     10.104.21.56    4am-node24   <none>           <none>

[2024-11-19T08:29:17.167Z] standalone-pod-failure-18586-etcd-1                               1/1     Running            0                 58m     10.104.27.43    4am-node31   <none>           <none>

[2024-11-19T08:29:17.167Z] standalone-pod-failure-18586-etcd-2                               1/1     Running            0                 58m     10.104.34.242   4am-node37   <none>           <none>

[2024-11-19T08:29:17.167Z] standalone-pod-failure-18586-kafka-0                              2/2     Running            0                 58m     10.104.21.64    4am-node24   <none>           <none>

[2024-11-19T08:29:17.167Z] standalone-pod-failure-18586-kafka-1                              2/2     Running            0                 58m     10.104.27.44    4am-node31   <none>           <none>

[2024-11-19T08:29:17.167Z] standalone-pod-failure-18586-kafka-2                              2/2     Running            0                 58m     10.104.26.155   4am-node32   <none>           <none>

[2024-11-19T08:29:17.167Z] standalone-pod-failure-18586-kafka-exporter-5b9c774b96-gn9mj      1/1     Running            4 (57m ago)       58m     10.104.18.203   4am-node25   <none>           <none>

[2024-11-19T08:29:17.167Z] standalone-pod-failure-18586-milvus-standalone-86dd4ff76-wfb8b    1/1     Running            4 (30m ago)       58m     10.104.18.210   4am-node25   <none>           <none>

[2024-11-19T08:29:17.168Z] standalone-pod-failure-18586-minio-8568fbdfc4-v5qvf               1/1     Running            0                 58m     10.104.21.57    4am-node24   <none>           <none>

[2024-11-19T08:29:17.168Z] standalone-pod-failure-18586-zookeeper-0                          1/1     Running            0                 58m     10.104.21.58    4am-node24   <none>           <none>

[2024-11-19T08:29:17.168Z] standalone-pod-failure-18586-zookeeper-1                          1/1     Running            0                 58m     10.104.34.240   4am-node37   <none>           <none>

[2024-11-19T08:29:17.168Z] standalone-pod-failure-18586-zookeeper-2                          1/1     Running            0                 58m     10.104.33.76    4am-node36   <none>           <none>

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2024
@yanliang567
Copy link
Contributor

/assign @weiliu1031
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2024
@yanliang567 yanliang567 added this to the 2.5.0 milestone Nov 20, 2024
@liliu-z
Copy link
Member

liliu-z commented Nov 20, 2024

/assign

@liliu-z
Copy link
Member

liliu-z commented Nov 20, 2024

/assign @czs007
/unassign @weiliu1031

@czs007
Copy link
Collaborator

czs007 commented Nov 20, 2024

2024/11/19 08:41:37.811 +00:00] [DEBUG] [querynodev2/services.go:73] ["QueryNode current state"] [NodeID=2] [StateCode=Healthy]
[2024/11/19 08:41:38.048 +00:00] [WARN] [kafka/kafka_consumer.go:147] ["consume msg failed"] [topic=by-dev-rootcoord-dml_11] [groupID=querynode-2-by-dev-rootcoord-dml_11_454033793065037702v0-true] [error="Local: Timed out"]
[2024/11/19 08:41:38.049 +00:00] [WARN] [kafka/kafka_consumer.go:147] ["consume msg failed"] [topic=by-dev-rootcoord-dml_15] [groupID=datanode-2-by-dev-rootcoord-dml_15_454033793065039833v0-true] [error="Local: Timed out"]
[2024/11/19 08:41:38.050 +00:00] [WARN] [kafka/kafka_consumer.go:147] ["consume msg failed"] [topic=by-dev-rootcoord-dml_8] [groupID=datanode-2-by-dev-rootcoord-dml_8_454033793065037470v1-true] [error="Local: Timed out"]
[2024/11/19 08:41:38.051 +00:00] [WARN] [kafka/kafka_consumer.go:147] ["consume msg failed"] [topic=by-dev-rootcoord-dml_1] [groupID=querynode-2-by-dev-rootcoord-dml_1_454033793065034312v0-true] [error="Local: Timed out"]
[2024/11/19 08:41:38.051 +00:00] [WARN] [kafka/kafka_consumer.go:147] ["consume msg failed"] [topic=by-dev-rootcoord-dml_14] [groupID=querynode-2-by-dev-rootcoord-dml_14_454033793065037821v1-true] [error="Local: Timed out"]
[2024/11/19 08:41:38.051 +00:00] [WARN] [kafka/kafka_consumer.go:147] ["consume msg failed"] [topic=by-dev-rootcoord-dml_13] [groupID=datanode-2-by-dev-rootcoord-dml_13_454033793065037821v0-true] [error="Local: Timed out"]
[2024/11/19 08:41:38.051 +00:00] [WARN] [kafka/kafka_consumer.go:147] ["consume msg failed"] [topic=by-dev-rootcoord-dml_6] [groupID=querynode-2-by-dev-rootcoord-dml_6_454033793065037462v1-true] [error="Local: Timed out"]
[2024/11/19 08:41:38.052 +00:00] [WARN] [kafka/kafka_consumer.go:147] ["consume msg failed"] [topic=by-dev-rootcoord-dml_12] [groupID=datanode-2-by-dev-rootcoord-dml_12_454033793065037702v1-true] [error="Local: Timed out"]
[2024/11/19 08:41:38.056 +00:00] [WARN] [rootcoord/timeticksync.go:314] ["rootcoord send tt to all channels too slowly"] [chanNum=16] [span=90013]

span=90013 milliseconds = 90 seconds

Kafka timed out, resulting in a significant channel CP lag.

The Search uses Strong consistency, causing it to queue up, awaiting the advancement of tsafe.

@liliu-z
Copy link
Member

liliu-z commented Nov 21, 2024

Saw big lag on kafka but didn't see any obvious resource racing. Need to rerun with metrics turned on
Similar to #36228

/unassign @czs007
/assign @zhuwenxing

@sre-ci-robot sre-ci-robot assigned zhuwenxing and unassigned czs007 Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants