Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sc] enable nebula_meta_ssl, after add hosts, balance in zone report error #262

Closed
jinyingsunny opened this issue Sep 7, 2023 · 2 comments
Assignees
Labels
affects/none PR/issue: this bug affects none version. process/done Process of bug severity/none Severity of bug type/bug Type: something is unexpected wontfix Solution: this will not be worked on

Comments

@jinyingsunny
Copy link

as title

(root@nebula) [uuu33]>  add hosts "nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779,"nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779,"nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779 into zone `us-east-2c`
Execution succeeded (time spent 1.152ms/2.318545ms)

Thu, 07 Sep 2023 20:47:27 CST

(root@nebula) [uuu33]> balance in zone
+------------+
| New Job Id |
+------------+
| 10         |
+------------+
Got 1 rows (time spent 1.237ms/2.184208ms)

Thu, 07 Sep 2023 20:47:31 CST

(root@nebula) [uuu33]> show job 10
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+----------------------------+----------------------------+-------------------+
| Job Id(spaceId:partId) | Command(src->dst)                                                                                                                                                         | Status     | Start Time                 | Stop Time                  | State             |
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+----------------------------+----------------------------+-------------------+
| 10                     | "DATA_BALANCE"                                                                                                                                                            | "FAILED"   | 2023-09-07T12:47:26.000000 | 2023-09-07T12:47:26.000000 | "E_JOB_SUBMITTED" |
| "10, 1:1"              | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | "FAILED"   | 2023-09-07T12:47:25.000000 | 2023-09-07T12:47:25.000000 | ""                |
| "10, 1:2"              | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | "FAILED"   | 2023-09-07T12:47:25.000000 | 2023-09-07T12:47:25.000000 | ""                |
| "10, 1:3"              | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | "FAILED"   | 2023-09-07T12:47:25.000000 | 2023-09-07T12:47:25.000000 | ""                |
| "10, 1:4"              | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | "FAILED"   | 2023-09-07T12:47:25.000000 | 2023-09-07T12:47:25.000000 | ""                |
| "10, 1:5"              | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | "FAILED"   | 2023-09-07T12:47:25.000000 | 2023-09-07T12:47:25.000000 | ""                |
| "10, 1:6"              | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | "FAILED"   | 2023-09-07T12:47:25.000000 | 2023-09-07T12:47:25.000000 | ""                |
| "10, 1:7"              | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | "FAILED"   | 2023-09-07T12:47:25.000000 | 2023-09-07T12:47:25.000000 | ""                |
| "Total:7"              | "Succeeded:0"                                                                                                                                                             | "Failed:7" | "In Progress:0"            | "Invalid:0"                | ""                |
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+----------------------------+----------------------------+-------------------+
Got 9 rows (time spent 1.098ms/2.467175ms)

Thu, 07 Sep 2023 20:47:33 CST

(root@nebula) [uuu33]> show parts
+--------------+---------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
| Partition ID | Leader                                                                                | Peers                                                                                                                                                                                                                                                          | Losts |
+--------------+---------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
| 1            | "nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | "nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-4.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | ""    |
| 2            | "nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | "nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-1.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | ""    |
| 3            | "nebulazcert-storaged-7.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | "nebulazcert-storaged-6.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-7.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | ""    |
| 4            | "nebulazcert-storaged-9.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-9.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-10.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | ""    |
| 5            | "nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | "nebulazcert-storaged-1.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | ""    |
| 6            | "nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | "nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-4.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | ""    |
| 7            | "nebulazcert-storaged-6.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | "nebulazcert-storaged-6.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-7.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | ""    |
| 8            | "nebulazcert-storaged-10.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-9.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-10.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779" | ""    |
| 9            | "nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | "nebulazcert-storaged-1.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | ""    |
| 10           | "nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | "nebulazcert-storaged-4.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779"  | ""    |
+--------------+---------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
Got 10 rows (time spent 1.269ms/2.303291ms)

Thu, 07 Sep 2023 20:48:01 CST

(root@nebula) [uuu33]> show zones
+--------------+----------------------------------------------------------------------------------+------+
| Name         | Host                                                                             | Port |
+--------------+----------------------------------------------------------------------------------+------+
| "us-east-2a" | "nebulazcert-storaged-1.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
| "us-east-2a" | "nebulazcert-storaged-4.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
| "us-east-2a" | "nebulazcert-storaged-6.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
| "us-east-2a" | "nebulazcert-storaged-9.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
| "us-east-2b" | "nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
| "us-east-2b" | "nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
| "us-east-2b" | "nebulazcert-storaged-7.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
| "us-east-2b" | "nebulazcert-storaged-10.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 |
| "us-east-2c" | "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 |
| "us-east-2c" | "nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
| "us-east-2c" | "nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
| "us-east-2c" | "nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 |
+--------------+----------------------------------------------------------------------------------+------+
Got 12 rows (time spent 1.874ms/3.725131ms)

Thu, 07 Sep 2023 20:51:21 CST

(root@nebula) [uuu33]> show hosts storage
+----------------------------------------------------------------------------------+------+----------+-----------+--------------+----------------+
| Host                                                                             | Port | Status   | Role      | Git Info Sha | Version        |
+----------------------------------------------------------------------------------+------+----------+-----------+--------------+----------------+
| "nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-1.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-4.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-6.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-7.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-9.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-10.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
| "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE" | "STORAGE" | "b10a112"    | "3.5.0-sc-ent" |
+----------------------------------------------------------------------------------+------+----------+-----------+--------------+----------------+
Got 12 rows (time spent 1.427ms/2.550591ms)

Thu, 07 Sep 2023 20:51:28 CST

balane in zone report error

[root@nebulazcert-metad-0 nebula]# grep job logs/nebula-metad.INFO
I20230907 12:47:26.193697   121 JobManager.cpp:570] Add job successfully, job id=10, job type=DATA_BALANCE
I20230907 12:47:26.204830   198 JobManager.cpp:311] Trying to end job, spaceId=1, jobId=10, target phase status=FAILED
[root@nebulazcert-metad-0 nebula]# grep "I20230907 12:47:26.193697" -A100 logs/nebula-metad.INFO
I20230907 12:47:26.193697   121 JobManager.cpp:570] Add job successfully, job id=10, job type=DATA_BALANCE
I20230907 12:47:26.196938   110 BalancePlan.cpp:141] bucketSize: 7, concurrency: 7
I20230907 12:47:26.196962   110 BalanceTask.cpp:47] 10, 1:7,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 still in processing
I20230907 12:47:26.196979   110 BalanceTask.cpp:52] 10, 1:7 Start to move part, check the peers firstly!
I20230907 12:47:26.198477   110 BalanceTask.cpp:47] 10, 1:6,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 still in processing
I20230907 12:47:26.198496   110 BalanceTask.cpp:52] 10, 1:6 Start to move part, check the peers firstly!
I20230907 12:47:26.200115   110 BalanceTask.cpp:47] 10, 1:2,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 still in processing
I20230907 12:47:26.200132   110 BalanceTask.cpp:52] 10, 1:2 Start to move part, check the peers firstly!
I20230907 12:47:26.201504   110 BalanceTask.cpp:47] 10, 1:1,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 still in processing
I20230907 12:47:26.201521   110 BalanceTask.cpp:52] 10, 1:1 Start to move part, check the peers firstly!
I20230907 12:47:26.201825   192 BalanceTask.cpp:58] 10, 1:7,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Check the peers failed, status ["nebulazcert-storaged-6.nebulazcert-storaged-headless.nebula.svc.cluster.local":9778] RPC failure in AdminClient: apache::thrift::transport::TTransportException: AsyncSocketException: recv() failed (peer=10.244.0.112:9778, local=10.244.3.65:59568), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
I20230907 12:47:26.201838   194 BalanceTask.cpp:58] 10, 1:6,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Check the peers failed, status ["nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local":9778] RPC failure in AdminClient: apache::thrift::transport::TTransportException: AsyncSocketException: recv() failed (peer=10.244.3.59:9778, local=10.244.3.65:44574), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
I20230907 12:47:26.202085   194 BalanceTask.cpp:42] 10, 1:6,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Task failed, status 1
I20230907 12:47:26.202098   194 BalancePlan.cpp:98] Balance 10 has completed 1 task
I20230907 12:47:26.202136   192 BalanceTask.cpp:42] 10, 1:7,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Task failed, status 1
I20230907 12:47:26.202155   192 BalancePlan.cpp:98] Balance 10 has completed 2 task
I20230907 12:47:26.202227   110 BalanceTask.cpp:47] 10, 1:3,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 still in processing
I20230907 12:47:26.202239   110 BalanceTask.cpp:52] 10, 1:3 Start to move part, check the peers firstly!
I20230907 12:47:26.202576   110 BalanceTask.cpp:47] 10, 1:4,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 still in processing
I20230907 12:47:26.202592   110 BalanceTask.cpp:52] 10, 1:4 Start to move part, check the peers firstly!
I20230907 12:47:26.202883   110 BalanceTask.cpp:47] 10, 1:5,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 still in processing
I20230907 12:47:26.202909   110 BalanceTask.cpp:52] 10, 1:5 Start to move part, check the peers firstly!
I20230907 12:47:26.203729   193 BalanceTask.cpp:58] 10, 1:3,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Check the peers failed, status ["nebulazcert-storaged-6.nebulazcert-storaged-headless.nebula.svc.cluster.local":9778] RPC failure in AdminClient: apache::thrift::transport::TTransportException: AsyncSocketException: recv() failed (peer=10.244.0.112:9778, local=10.244.3.65:59584), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
I20230907 12:47:26.203833   193 BalanceTask.cpp:42] 10, 1:3,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Task failed, status 1
I20230907 12:47:26.203842   193 BalancePlan.cpp:98] Balance 10 has completed 3 task
I20230907 12:47:26.204187   198 BalanceTask.cpp:58] 10, 1:2,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Check the peers failed, status ["nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local":9778] RPC failure in AdminClient: apache::thrift::transport::TTransportException: AsyncSocketException: recv() failed (peer=10.244.3.59:9778, local=10.244.3.65:44582), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
I20230907 12:47:26.204269   198 BalanceTask.cpp:42] 10, 1:2,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Task failed, status 1
I20230907 12:47:26.204277   198 BalancePlan.cpp:98] Balance 10 has completed 4 task
I20230907 12:47:26.204324   195 BalanceTask.cpp:58] 10, 1:4,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Check the peers failed, status ["nebulazcert-storaged-9.nebulazcert-storaged-headless.nebula.svc.cluster.local":9778] RPC failure in AdminClient: apache::thrift::transport::TTransportException: AsyncSocketException: recv() failed (peer=10.244.0.115:9778, local=10.244.3.65:47550), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
I20230907 12:47:26.204412   195 BalanceTask.cpp:42] 10, 1:4,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Task failed, status 1
I20230907 12:47:26.204429   195 BalancePlan.cpp:98] Balance 10 has completed 5 task
I20230907 12:47:26.204517   199 BalanceTask.cpp:58] 10, 1:1,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Check the peers failed, status ["nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local":9778] RPC failure in AdminClient: apache::thrift::transport::TTransportException: AsyncSocketException: recv() failed (peer=10.244.3.54:9778, local=10.244.3.65:50506), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
I20230907 12:47:26.204591   199 BalanceTask.cpp:42] 10, 1:1,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Task failed, status 1
I20230907 12:47:26.204599   199 BalancePlan.cpp:98] Balance 10 has completed 6 task
I20230907 12:47:26.204674   198 BalanceTask.cpp:58] 10, 1:5,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Check the peers failed, status ["nebulazcert-storaged-1.nebulazcert-storaged-headless.nebula.svc.cluster.local":9778] RPC failure in AdminClient: apache::thrift::transport::TTransportException: AsyncSocketException: recv() failed (peer=10.244.0.105:9778, local=10.244.3.65:35712), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
I20230907 12:47:26.204753   198 BalanceTask.cpp:42] 10, 1:5,nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Task failed, status 1
I20230907 12:47:26.204761   198 BalancePlan.cpp:98] Balance 10 has completed 7 task
I20230907 12:47:26.204766   198 BalancePlan.cpp:102] Balance 10 failed!
I20230907 12:47:26.204830   198 JobManager.cpp:311] Trying to end job, spaceId=1, jobId=10, target phase status=FAILED
I20230907 12:47:26.204887   198 JobManager.cpp:295] cleanJob 10

errors in operator:

E0907 12:56:23.250556       1 storage_client.go:89] TransLeader failed: read tcp 10.244.1.253:34752->10.244.3.64:9778: read: connection reset by peer
E0907 12:56:23.250668       1 nebula_cluster_control.go:119] reconcile storaged cluster failed: read tcp 10.244.1.253:34752->10.244.3.64:9778: read: connection reset by peer

Your Environments (required)

operator:reg.vesoft-inc.com/cloud-dev/nebula-operator:snap-1.12
nebula-ent-sc: reg.vesoft-inc.com/rc/nebula-metad-ent pushed time 9/7/23, 4:49 PM

How To Reproduce(required)

Steps to reproduce the behavior:

1. with nebula_meta_ssl on ,3 graph , 12storage,1meta.
2. balance in zone remove "nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779,"nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779,"nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779
3. drop hosts "nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779,"nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779,"nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779
4. add hosts "nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779,"nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779,"nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779 into zone `us-east-2a`
5. balance in zone 

Expected behavior

success

@jinyingsunny jinyingsunny added the type/bug Type: something is unexpected label Sep 7, 2023
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Sep 7, 2023
@jinyingsunny
Copy link
Author

jinyingsunny commented Sep 8, 2023

经过沟通,operator会做自动扩缩容,修改配置即可,不应该再通过console做手动运维,因此,测试场景超纲;请产品 @MuYiYong 考虑长远统一设计。

@github-actions github-actions bot added the process/fixed Process of bug label Sep 8, 2023
@jinyingsunny
Copy link
Author

上面的报错是 add hosts后很快执行 balance 时出现的; 等了一段时间再来重试,没复现。今日重试,也没复现

当host是offline的时候,执行balance ,有数据需要移动到这个offline的节点时,也会报和上面相同的错:
image

20230908 04:01:27.414680   197 BalanceTask.cpp:103] 21, 13:8,nebulazcert-storaged-9.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779->nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779 Open part failed, status ["nebulazcert-storaged-11.ne
bulazcert-storaged-headless.nebula.svc.cluster.local":9778] RPC failure in AdminClient: apache::thrift::transport::TTransportException: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111
(Connection refused)

@jinyingsunny jinyingsunny added process/done Process of bug wontfix Solution: this will not be worked on and removed process/fixed Process of bug labels Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. process/done Process of bug severity/none Severity of bug type/bug Type: something is unexpected wontfix Solution: this will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants