Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[扩缩容-sc]set storage replica=30,only 26个pod成功了,此时缩容至24个,operator报错了 #267

Closed
jinyingsunny opened this issue Sep 13, 2023 · 2 comments
Assignees
Labels
affects/none PR/issue: this bug affects none version. process/done Process of bug severity/none Severity of bug type/bug Type: something is unexpected

Comments

@jinyingsunny
Copy link

如题,出现了最终不一致状态:
👇storage有24个pod
image

👇storage 还是缩容前的26个 ,且存在两个都是
image

operator的报错:

E0913 06:36:15.367434       1 pvc.go:64] get PVC [nebula/storaged-log-nebulazcert-storaged-24] failed: persistentvolumeclaims "storaged-log-nebulazcert-storaged-24" not found
E0913 06:36:15.383864       1 pvc.go:64] get PVC [nebula/storaged-data-nebulazcert-storaged-24] failed: persistentvolumeclaims "storaged-data-nebulazcert-storaged-24" not found

另外比较疑惑的问题是,我当前有两个space,当前都是连续执行了四次 balance leader,不应该先执行 balance data么 ?

(root@nebula) [baske3s]> show hosts
+----------------------------------------------------------------------------------+------+-----------+--------------+----------------------------+------------------------------+----------------+
| Host                                                                             | Port | Status    | Leader count | Leader distribution        | Partition distribution       | Version        |
+----------------------------------------------------------------------------------+------+-----------+--------------+----------------------------+------------------------------+----------------+
| "nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 8            | "baske3s:4, baske3s_int:4" | "baske3s:12, baske3s_int:12" | "3.5.0-sc-ent" |
| "nebulazcert-storaged-1.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 8            | "baske3s:4, baske3s_int:4" | "baske3s:12, baske3s_int:12" | "3.5.0-sc-ent" |
| "nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 8            | "baske3s:4, baske3s_int:4" | "baske3s:12, baske3s_int:12" | "3.5.0-sc-ent" |
| "nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-4.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-6.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-7.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-9.nebulazcert-storaged-headless.nebula.svc.cluster.local"  | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-10.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-12.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-13.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-14.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-15.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-16.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-17.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-18.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-19.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-20.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-21.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-22.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-23.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "ONLINE"  | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-24.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "OFFLINE" | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
| "nebulazcert-storaged-25.nebulazcert-storaged-headless.nebula.svc.cluster.local" | 9779 | "OFFLINE" | 0            | "No valid partition"       | "No valid partition"         | "3.5.0-sc-ent" |
+----------------------------------------------------------------------------------+------+-----------+--------------+----------------------------+------------------------------+----------------+
Got 26 rows (time spent 2.585ms/4.475831ms)

Wed, 13 Sep 2023 15:14:50 CST

(root@nebula) [baske3s]> show jobs
+--------+----------------------+------------+----------------------------+----------------------------+
| Job Id | Command              | Status     | Start Time                 | Stop Time                  |
+--------+----------------------+------------+----------------------------+----------------------------+
| 18     | "LEADER_BALANCE"     | "FINISHED" | 2023-09-13T06:36:42.000000 | 2023-09-13T06:36:42.000000 |
| 16     | "LEADER_BALANCE"     | "FINISHED" | 2023-09-13T06:36:32.000000 | 2023-09-13T06:36:33.000000 |
| 14     | "LEADER_BALANCE"     | "FINISHED" | 2023-09-13T06:36:21.000000 | 2023-09-13T06:36:23.000000 |
| 12     | "LEADER_BALANCE"     | "FINISHED" | 2023-09-13T06:36:10.000000 | 2023-09-13T06:36:10.000000 |
| 11     | "LEADER_BALANCE"     | "FINISHED" | 2023-09-13T04:33:36.000000 | 2023-09-13T04:33:36.000000 |
| 10     | "LEADER_BALANCE"     | "FINISHED" | 2023-09-13T04:33:28.000000 | 2023-09-13T04:33:33.000000 |
| 8      | "DATA_BALANCE"       | "FINISHED" | 2023-09-13T04:32:57.000000 | 2023-09-13T04:33:02.000000 |
| 7      | "REBUILD_EDGE_INDEX" | "FINISHED" | 2023-09-13T03:02:19.000000 | 2023-09-13T03:02:20.000000 |
| 4      | "LEADER_BALANCE"     | "FINISHED" | 2023-09-13T02:31:44.000000 | 2023-09-13T02:31:49.000000 |
| 3      | "DATA_BALANCE"       | "FINISHED" | 2023-09-13T02:31:19.000000 | 2023-09-13T02:31:34.000000 |
+--------+----------------------+------------+----------------------------+----------------------------+
Got 10 rows (time spent 1.553ms/3.054991ms)

看到了nebula-meta中:
drop hosts中的报错:

I20230913 06:36:10.307837   121 DropHostsProcessor.cpp:120] Zone Value: nebulazcert-storaged-2.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-5.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-8.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-10.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-14.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-17.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-20.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-23.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779
I20230913 06:36:10.307907   121 DropHostsProcessor.cpp:120] Zone Value: nebulazcert-storaged-1.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-3.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-7.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-11.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-13.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-16.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-19.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-22.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779
I20230913 06:36:10.307925   121 DropHostsProcessor.cpp:120] Zone Value: nebulazcert-storaged-0.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-4.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-6.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-9.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-12.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-15.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-18.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779, nebulazcert-storaged-21.nebulazcert-storaged-headless.nebula.svc.cluster.local:9779
I20230913 06:36:10.307986   121 DropHostsProcessor.cpp:141] The machine "nebulazcert-storaged-29.nebulazcert-storaged-headless.nebula.svc.cluster.local":9779 not existed!

Your Environments (required)

nebula-operator:reg.vesoft-inc.com/cloud-dev/nebula-operator:snap-1.13
pushed time 9/12/23, 10:20 PM

How To Reproduce(required)

1. storage扩容到30个副本,其中4个没成功,pod处于pending状态;
2. 再执行缩容,从30个storage缩容成24个storage;

Expected behavior
缩容成功

@jinyingsunny jinyingsunny added the type/bug Type: something is unexpected label Sep 13, 2023
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Sep 13, 2023
@jinyingsunny jinyingsunny changed the title [扩缩容]set storage replica=30,only 26个pod成功了,此时缩容至24个,operator报错了 [扩缩容-sc]set storage replica=30,only 26个pod成功了,此时缩容至24个,operator报错了 Sep 13, 2023
@MegaByte875
Copy link
Contributor

#272

@jinyingsunny
Copy link
Author

jinyingsunny commented Sep 14, 2023

checked on reg.vesoft-inc.com/cloud-dev/nebula-operator:snap-1.14. rebuilded today.
1.扩容到300个节点,发现前40个pod是成功了的;剩余260个节点是pending状态;
2.console中和comfigmag kubectl -n nebula get cm nebulazcert-storaged-zone -o yaml ,都是40个节点;
3.由于扩容没有成功,因此没有执行 balance data命令数据不均匀;但是 pvc是保持300个,后面的260个是pending状态;

下一步:
4.再把节点改成36个,此时缩容成功。pod中有40个,在 kubectl -n nebula get nc nebulazcert -oyaml中状态变成ready时

    storaged:
      lastBalanceJob:
        jobID: 8
        spaceID: 2
      phase: ScaleIn
      version: v3.5.0-sc
      workload:
        availableReplicas: 36
        collisionCount: 0
        currentReplicas: 36
        currentRevision: nebulazcert-storaged-58cb5cb744
        observedGeneration: 4
        readyReplicas: 36
        replicas: 36
        updateRevision: nebulazcert-storaged-58cb5cb744
        updatedReplicas: 36

再来检查comfigmap,维持的是36个,符合预期。
5.console中看到的也是36个节点,都是online状态

@github-actions github-actions bot added the process/fixed Process of bug label Sep 14, 2023
@jinyingsunny jinyingsunny added process/done Process of bug and removed process/fixed Process of bug labels Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. process/done Process of bug severity/none Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

2 participants