Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: BALANCE DATA REMOVE failed with [ERROR (-8)]: The cluster is balanced! #2731

Closed
veezhang opened this issue May 11, 2021 · 14 comments
Closed
Labels
type/question Type: question about the product

Comments

@veezhang
Copy link
Contributor

Steps:

  1. Create a cluster with 1 graph, 3 meta and 3 storage.
  2. Init some data.
CREATE SPACE IF NOT EXISTS e2e_test(partition_num=15,replica_factor=3);
USE e2e_test;
CREATE TAG IF NOT EXISTS person(name string, age int);
CREATE EDGE IF NOT EXISTS like(likeness double);
INSERT VERTEX person(name, age) VALUES 'Bob':('Bob', 10), 'Lily':('Lily', 9), 'Tom':('Tom', 10), 'Jerry':('Jerry', 13), 'John':('John', 11);
  1. Scale storage out to 4 and then to 5.
  2. Insert some data.
USE e2e_test;
INSERT EDGE like(likeness) VALUES 'Bob'->'Lily':(80.0), 'Bob'->'Tom':(70.0);
  1. Scale storage in to 4 and then to 3.

But when I Scale storage in to 3, here comes the error with [ERROR (-8)]: No hosts! as follows:

BALANCE DATA REMOVE "test1-3-3-storaged-3.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local":9779
[ERROR (-8)]: No hosts!

And there are some error logs that may help.

show hosts:

(root@nebula) [e2e_test]> show hosts
+-----------------------------------------------------------------------------------------+------+-----------+--------------+----------------------+------------------------+
| Host                                                                                    | Port | Status    | Leader count | Leader distribution  | Partition distribution |
+-----------------------------------------------------------------------------------------+------+-----------+--------------+----------------------+------------------------+
| "test1-3-3-storaged-0.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local" | 9779 | "ONLINE"  | 4            | "e2e_test:4"         | "e2e_test:11"          |
+-----------------------------------------------------------------------------------------+------+-----------+--------------+----------------------+------------------------+
| "test1-3-3-storaged-1.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local" | 9779 | "ONLINE"  | 4            | "e2e_test:4"         | "e2e_test:12"          |
+-----------------------------------------------------------------------------------------+------+-----------+--------------+----------------------+------------------------+
| "test1-3-3-storaged-2.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local" | 9779 | "ONLINE"  | 3            | "e2e_test:3"         | "e2e_test:12"          |
+-----------------------------------------------------------------------------------------+------+-----------+--------------+----------------------+------------------------+
| "test1-3-3-storaged-3.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local" | 9779 | "ONLINE"  | 4            | "e2e_test:4"         | "e2e_test:11"          |
+-----------------------------------------------------------------------------------------+------+-----------+--------------+----------------------+------------------------+
| "test1-3-3-storaged-4.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local" | 9779 | "OFFLINE" | 0            | "No valid partition" | "No valid partition"   |
+-----------------------------------------------------------------------------------------+------+-----------+--------------+----------------------+------------------------+
| "Total"                                                                                 |      |           | 15           | "e2e_test:15"        | "e2e_test:46"          |
+-----------------------------------------------------------------------------------------+------+-----------+--------------+----------------------+------------------------+

meta logs:

E0510 10:31:45.725432   124 Balancer.cpp:333] Can't find a host which doesn't have part: 6
E0510 10:31:45.725461   124 Balancer.cpp:299] Transfer lost host "test1-3-3-storaged-3.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local":9779 failed
E0510 10:31:45.725469   124 Balancer.cpp:224] Generate tasks on space 1 failed
E0510 10:31:45.725479   124 Balancer.cpp:37] Create balance plan failed
E0510 10:31:45.725487   124 BalanceProcessor.cpp:120] Balance Failed: E_NO_HOSTS

Here is the meta log file metad-stderr.log.

@HarrisChu
Copy link
Contributor

balance data remove use http port 19779, so it should be BALANCE DATA REMOVE "test1-3-3-storaged-3.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local":19779

but it will be confused for users.

@veezhang
Copy link
Contributor Author

Cannot remove host for balanced cluster.

BALANCE DATA REMOVE "test1-3-3-storaged-3.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local":19779
[ERROR (-8)]: The cluster is balanced!

@veezhang veezhang changed the title storage: BALANCE DATA REMOVE failed with [ERROR (-8)]: No hosts! storage: BALANCE DATA REMOVE failed with [ERROR (-8)]: The cluster is balanced! May 11, 2021
@darionyaphet
Copy link
Contributor

"test1-3-3-storaged-3.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local":19779 not a available host

becase it's port should be 9779

@critical27
Copy link
Contributor

image

@critical27 critical27 reopened this May 12, 2021
@darionyaphet
Copy link
Contributor

image

@critical27
Copy link
Contributor

critical27 commented May 13, 2021

Remove didn't require it is online... How can you shrink the cluster when a node is dead? It does works in 1.0.

@darionyaphet
Copy link
Contributor

how to process the task which is related to the lost node, especially REMOVE_PART_ON_SRC.

It will works on 1.0 ?

@critical27
Copy link
Contributor

REMOVE_PART_ON_SRC could be done lazily, part info in meta is most important. Definitely works previously.

@veezhang
Copy link
Contributor Author

The error occurred when remove test1-3-3-storaged-3.... nor test1-3-3-storaged-4.... .
And the test1-3-3-storaged-4.... had removed before.

@darionyaphet
Copy link
Contributor

this space's partition number is 15 and replication factor is 3 so here should have 45 data part why 46 ?

@darionyaphet
Copy link
Contributor

Can't find a host which doesn't have part: 6 Is your cluster still working?

@veezhang
Copy link
Contributor Author

veezhang commented Jun 2, 2021

this space's partition number is 15 and replication factor is 3 so here should have 45 data part why 46 ?

Sorry... I don't know about this.

Can't find a host which doesn't have part: 6 Is your cluster still working?

It's used for test. And this cluster is already destroy.

@CPWstatic CPWstatic transferred this issue from vesoft-inc/nebula-storage Aug 28, 2021
@CPWstatic CPWstatic added the type/question Type: question about the product label Aug 28, 2021
@critical27
Copy link
Contributor

Fixed, closed

@HarrisChu
Copy link
Contributor

"test1-3-3-storaged-3.test1-3-3-storaged-headless.nebulacluster-7650.svc.cluster.local":19779 not a available host

becase it's port should be 9779

@randomJoe211
please recorrect doc, it may be confused. https://docs.nebula-graph.io/2.0/8.service-tuning/load-balance/

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Type: question about the product
Projects
None yet
Development

No branches or pull requests

5 participants