Cluster (v3.2.0) becomes unstable when more data was ingested to a existing space #4668

porscheme · 2022-09-22T04:04:08Z

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Cluster becomes unstable beyond these statistics
-- vertices: 50 Million
-- Edges: 3.3 Billon
We are seeing this in our TEST cluster

I20220915 16:28:09.498340   102 NebulaSnapshotManager.cpp:67] Space 61 Part 34 start send snapshot of commitLogId 89598562 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20220915 16:28:09.498370   102 SnapshotManager.cpp:51] [Port: 9780, Space: 61, Part: 34] Snapshot send failed, the leader changed?
I20220915 16:28:09.498400   102 Host.cpp:355] [Port: 9780, Space: 61, Part: 34] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Send snapshot failed!
I20220915 16:28:09.498530    53 Host.cpp:337] [Port: 9780, Space: 61, Part: 115] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Can't find log 1 in wal, send the snapshot, logIdToSend = 32549073, firstLogId in wal = 32538198, lastLogId in wal = 32549073
I20220915 16:28:09.498550    66 Host.cpp:337] [Port: 9780, Space: 61, Part: 124] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Can't find log 1 in wal, send the snapshot, logIdToSend = 97588614, firstLogId in wal = 97582907, lastLogId in wal = 97588614

Your Environments (required)

OS: 18.04.1-Ubuntu x86_64 GNU/Linux
Compiler:
Using docker images
CPU: lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7551 32-Core Processor
Stepping: 2
CPU MHz: 1996.300
BogoMIPS: 3992.60
Hypervisor vendor: *****
Virtualization type: full
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
Commit id (e.g. a3ffc7d8)
Not sure how to get this

How To Reproduce(required)

Steps to reproduce the behavior:

Create a Nebula cluster
-- graphd VM count: 3 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks)
-- metad VM count: 3 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks)
-- storaged VM count: 9 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks)
Create a space with INT64 Vid Type, 200 Patition_num, 2 Replica Factor
Initially, ingest data using spark-connector
-- vertices: 50 Million
-- Edges: 3.3 Billon
Run 'SUBMITJOB COMPACT', executed successfully
Verify cluster is very healthy, physical size of Data & Log on disk, they are not significant
Ingest more data, observe cluster becomes unstable

Expected behavior

Cluster should be stable for at least 1 billion vertices and 50 billion edges

Additional context

We did not run BALANCE DATA
We created massive cluster to eliminate storage and memory issues

The text was updated successfully, but these errors were encountered:

wey-gu · 2022-09-22T04:08:43Z

cc @Sophie-Xie

liwenhui-soul · 2022-09-22T04:36:15Z

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

porscheme · 2022-09-22T06:47:36Z

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

Thanks for the reply. I will try with replica factor 3 and report back.
Which is better Vid Type, INT64 or STRING?
Also, Partition_num 200 seems okay?

liwenhui-soul · 2022-09-22T09:35:05Z

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?

INT64 and STRING are both ok, I think.
Partition_num 200 is ok.

porscheme · 2022-10-04T05:34:31Z

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?

INT64 and STRING are both ok, I think. Partition_num 200 is ok.

After setting Replica Factor 3; cluster seems to be more stable.

porscheme added the type/bug Type: something is unexpected label Sep 22, 2022

wey-gu mentioned this issue Sep 24, 2022

Weekly Report 2022-09-23 vesoft-inc/nebula-community#134

Closed

porscheme closed this as completed Oct 4, 2022

wey-gu mentioned this issue Oct 8, 2022

Weekly Report 2022-10-07 vesoft-inc/nebula-community#137

Closed

jinyingsunny added the wontfix Solution: this will not be worked on recently label Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster (v3.2.0) becomes unstable when more data was ingested to a existing space #4668

Cluster (v3.2.0) becomes unstable when more data was ingested to a existing space #4668

porscheme commented Sep 22, 2022

wey-gu commented Sep 22, 2022

liwenhui-soul commented Sep 22, 2022 •

edited

Loading

porscheme commented Sep 22, 2022

liwenhui-soul commented Sep 22, 2022

porscheme commented Oct 4, 2022

Cluster (v3.2.0) becomes unstable when more data was ingested to a existing space #4668

Cluster (v3.2.0) becomes unstable when more data was ingested to a existing space #4668

Comments

porscheme commented Sep 22, 2022

wey-gu commented Sep 22, 2022

liwenhui-soul commented Sep 22, 2022 • edited Loading

porscheme commented Sep 22, 2022

liwenhui-soul commented Sep 22, 2022

porscheme commented Oct 4, 2022

liwenhui-soul commented Sep 22, 2022 •

edited

Loading