Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: After mixcoord rolling upgrade, a restart occurs, and then it remains in an unhealthy state and cannot be recovered #37432

Closed
1 task done
zhuwenxing opened this issue Nov 5, 2024 · 5 comments
Assignees
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. test/rolling upgrade

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.14--> master-20241104-f54cf418-amd64
- Deployment mode(standalone or cluster):mixcoord
- MQ type(rocksmq, pulsar or kafka): kafka   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

mixcoord rolling update finished

[2024-11-04T17:09:51.958Z] [2024-11-04 17:09:51 - INFO - ci_test]: [update image for mixCoord]wait 10s for milvus ready (test_rolling_update_one_by_one.py:267)
[2024-11-04T17:10:01.884Z] [2024-11-04 17:10:01 - INFO - ci_test]: cmd: kubectl get pod|grep kafka-mixcoord-5441 (test_rolling_update_one_by_one.py:139)
[2024-11-04T17:10:02.143Z] [2024-11-04 17:10:02 - INFO - ci_test]: kubectl get pod|grep kafka-mixcoord-5441
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-etcd-0                                       1/1     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-etcd-1                                       1/1     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-etcd-2                                       1/1     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-kafka-0                                      2/2     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-kafka-1                                      2/2     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-kafka-2                                      2/2     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-kafka-exporter-69f96d6dd4-npfhr              1/1     Running       4 (13m ago)        14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-kafka-zookeeper-0                            1/1     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-kafka-zookeeper-1                            1/1     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-kafka-zookeeper-2                            1/1     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-datanode-79f578d74b-hctnr             1/1     Running       0                  12m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-datanode-79f578d74b-w2vjf             1/1     Running       0                  12m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-datanode-79f578d74b-w7nqj             1/1     Running       0                  12m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-indexnode-b7675b649-4nfr5             1/1     Running       0                  4m14s
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-indexnode-b7675b649-dklmj             1/1     Running       0                  4m55s
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-indexnode-b7675b649-smj6m             1/1     Running       0                  5m36s
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-mixcoord-6c85784f75-g7zzc             1/1     Running       0                  51s
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-proxy-55c9d74c58-c227r                1/1     Running       0                  12m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-querynode-0-f8bb9c567-ghld9           1/1     Running       0                  12m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-querynode-0-f8bb9c567-vcqqc           1/1     Running       0                  12m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-milvus-querynode-0-f8bb9c567-vgm4k           1/1     Running       0                  12m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-minio-0                                      1/1     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-minio-1                                      1/1     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-minio-2                                      1/1     Running       0                  14m
[2024-11-04T17:10:02.145Z] kafka-mixcoord-5441-minio-3                                      1/1     Running       0                  14m

mixcoord restarted

[2024-11-04T17:11:23.641Z] [2024-11-04 17:11:23 - INFO - ci_test]: [update image for mixCoord]status not stable, continue waiting (test_rolling_update_one_by_one.py:265)
[2024-11-04T17:11:23.641Z] [2024-11-04 17:11:23 - INFO - ci_test]: cmd: kubectl get pod|grep kafka-mixcoord-5441 (test_rolling_update_one_by_one.py:139)
[2024-11-04T17:11:23.641Z] [2024-11-04 17:11:23 - INFO - ci_test]: kubectl get pod|grep kafka-mixcoord-5441
[2024-11-04T17:11:23.641Z] kafka-mixcoord-5441-etcd-0                                       1/1     Running            0                 15m
[2024-11-04T17:11:23.641Z] kafka-mixcoord-5441-etcd-1                                       1/1     Running            0                 15m
[2024-11-04T17:11:23.641Z] kafka-mixcoord-5441-etcd-2                                       1/1     Running            0                 15m
[2024-11-04T17:11:23.641Z] kafka-mixcoord-5441-kafka-0                                      2/2     Running            0                 15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-kafka-1                                      2/2     Running            0                 15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-kafka-2                                      2/2     Running            0                 15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-kafka-exporter-69f96d6dd4-npfhr              1/1     Running            4 (14m ago)       15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-kafka-zookeeper-0                            1/1     Running            0                 15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-kafka-zookeeper-1                            1/1     Running            0                 15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-kafka-zookeeper-2                            1/1     Running            0                 15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-datanode-79f578d74b-hctnr             1/1     Running            0                 14m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-datanode-79f578d74b-w2vjf             1/1     Running            0                 14m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-datanode-79f578d74b-w7nqj             1/1     Running            0                 14m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-indexnode-b7675b649-4nfr5             1/1     Running            0                 5m36s
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-indexnode-b7675b649-dklmj             1/1     Running            0                 6m17s
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-indexnode-b7675b649-smj6m             1/1     Running            0                 6m58s
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-mixcoord-6c85784f75-g7zzc             0/1     Running            1 (8s ago)        2m13s
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-proxy-55c9d74c58-c227r                1/1     Running            0                 14m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-querynode-0-f8bb9c567-ghld9           1/1     Running            0                 14m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-querynode-0-f8bb9c567-vcqqc           1/1     Running            0                 14m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-milvus-querynode-0-f8bb9c567-vgm4k           1/1     Running            0                 14m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-minio-0                                      1/1     Running            0                 15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-minio-1                                      1/1     Running            0                 15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-minio-2                                      1/1     Running            0                 15m
[2024-11-04T17:11:23.642Z] kafka-mixcoord-5441-minio-3                                      1/1     Running            0                 15m

after 4min,mixcoord still in unhealthy status

[2024-11-04T17:15:19.405Z] [2024-11-04 17:15:18 - INFO - ci_test]: [update image for mixCoord]wait 10s for milvus ready (test_rolling_update_one_by_one.py:267)
[2024-11-04T17:15:29.346Z] [2024-11-04 17:15:28 - INFO - ci_test]: cmd: kubectl get pod|grep kafka-mixcoord-5441 (test_rolling_update_one_by_one.py:139)
[2024-11-04T17:15:29.347Z] [2024-11-04 17:15:29 - INFO - ci_test]: kubectl get pod|grep kafka-mixcoord-5441
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-etcd-0                                       1/1     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-etcd-1                                       1/1     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-etcd-2                                       1/1     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-kafka-0                                      2/2     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-kafka-1                                      2/2     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-kafka-2                                      2/2     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-kafka-exporter-69f96d6dd4-npfhr              1/1     Running       4 (18m ago)        19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-kafka-zookeeper-0                            1/1     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-kafka-zookeeper-1                            1/1     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-kafka-zookeeper-2                            1/1     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-datanode-79f578d74b-hctnr             1/1     Running       0                  18m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-datanode-79f578d74b-w2vjf             1/1     Running       0                  18m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-datanode-79f578d74b-w7nqj             1/1     Running       0                  18m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-indexnode-b7675b649-4nfr5             1/1     Running       0                  9m41s
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-indexnode-b7675b649-dklmj             1/1     Running       0                  10m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-indexnode-b7675b649-smj6m             1/1     Running       0                  11m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-mixcoord-6c85784f75-g7zzc             0/1     Running       1 (4m13s ago)      6m18s
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-proxy-55c9d74c58-c227r                1/1     Running       0                  18m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-querynode-0-f8bb9c567-ghld9           1/1     Running       0                  18m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-querynode-0-f8bb9c567-vcqqc           1/1     Running       0                  18m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-milvus-querynode-0-f8bb9c567-vgm4k           1/1     Running       0                  18m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-minio-0                                      1/1     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-minio-1                                      1/1     Running       0                  19m
[2024-11-04T17:15:29.347Z] kafka-mixcoord-5441-minio-2                                      1/1     Running       0                  19m
[2024-11-04T17:15:29.348Z] kafka-mixcoord-5441-minio-3                                      1/1     Running       0                  19m

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5441/pipeline/82/

log:

artifacts-kafka-mixcoord-5441-server-logs.tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 5, 2024
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Nov 5, 2024
@zhuwenxing
Copy link
Contributor Author

/assign @weiliu1031

@zhuwenxing
Copy link
Contributor Author

when deploy mode is cluster, the datacoord is keeping crash during upgrade

[2024-11-04T17:16:22.340Z] [2024-11-04 17:16:20 - INFO - ci_test]: [update image for ['dataCoord', 'indexCoord']]wait 10s for milvus ready (test_rolling_update_one_by_one.py:267)
[2024-11-04T17:16:30.407Z] [2024-11-04 17:16:30 - INFO - ci_test]: cmd: kubectl get pod|grep kafka-cluster-5440 (test_rolling_update_one_by_one.py:139)
[2024-11-04T17:16:30.668Z] [2024-11-04 17:16:30 - INFO - ci_test]: kubectl get pod|grep kafka-cluster-5440
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-etcd-0                                        1/1     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-etcd-1                                        1/1     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-etcd-2                                        1/1     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-kafka-0                                       2/2     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-kafka-1                                       2/2     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-kafka-2                                       2/2     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-kafka-exporter-9fb468c6d-mmmhj                1/1     Running            3 (22m ago)        22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-kafka-zookeeper-0                             1/1     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-kafka-zookeeper-1                             1/1     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-kafka-zookeeper-2                             1/1     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-datacoord-7b76ffbc86-r778j             0/1     CrashLoopBackOff   5 (15s ago)        7m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-datanode-5fd5596bd8-44nck              1/1     Running            0                  21m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-datanode-5fd5596bd8-kkkdr              1/1     Running            0                  21m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-datanode-5fd5596bd8-vgdp6              1/1     Running            0                  21m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-indexcoord-855fd75cd8-q7dmv            1/1     Running            0                  7m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-indexnode-5488fb4647-g2fvd             1/1     Running            0                  13m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-indexnode-5488fb4647-mwxvw             1/1     Running            0                  14m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-indexnode-5488fb4647-wtg9w             1/1     Running            0                  15m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-proxy-5977d886c6-bkww9                 1/1     Running            0                  21m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-querycoord-7c4b485775-sk74z            1/1     Running            0                  21m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-querynode-0-777b598d68-4x852           1/1     Running            0                  21m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-querynode-0-777b598d68-gwft7           1/1     Running            0                  21m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-querynode-0-777b598d68-zf787           1/1     Running            0                  21m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-milvus-rootcoord-7d767cf9db-sxv7k             1/1     Running            0                  10m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-minio-0                                       1/1     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-minio-1                                       1/1     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-minio-2                                       1/1     Running            0                  22m
[2024-11-04T17:16:30.668Z] kafka-cluster-5440-minio-3                                       1/1     Running            0                  22m

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5440/pipeline

log:
artifacts-kafka-cluster-5440-server-logs.tar.gz

@weiliu1031
Copy link
Contributor

known issue, should be fixed by #37418, please verify this with latest image

@weiliu1031
Copy link
Contributor

/assign @zhuwenxing

@zhuwenxing
Copy link
Contributor Author

verified and fixed in master-20241105-b83b376c-amd64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. test/rolling upgrade
Projects
None yet
Development

No branches or pull requests

3 participants