Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster (v3.2.0) becomes unstable when more data was ingested to a existing space #4668

Closed
porscheme opened this issue Sep 22, 2022 · 5 comments
Labels
type/bug Type: something is unexpected wontfix Solution: this will not be worked on recently

Comments

@porscheme
Copy link

Please check the FAQ documentation before raising an issue

Describe the bug (required)

  • Cluster becomes unstable beyond these statistics
    -- vertices: 50 Million
    -- Edges: 3.3 Billon

  • We are seeing this in our TEST cluster

I20220915 16:28:09.498340   102 NebulaSnapshotManager.cpp:67] Space 61 Part 34 start send snapshot of commitLogId 89598562 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20220915 16:28:09.498370   102 SnapshotManager.cpp:51] [Port: 9780, Space: 61, Part: 34] Snapshot send failed, the leader changed?
I20220915 16:28:09.498400   102 Host.cpp:355] [Port: 9780, Space: 61, Part: 34] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Send snapshot failed!
I20220915 16:28:09.498530    53 Host.cpp:337] [Port: 9780, Space: 61, Part: 115] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Can't find log 1 in wal, send the snapshot, logIdToSend = 32549073, firstLogId in wal = 32538198, lastLogId in wal = 32549073
I20220915 16:28:09.498550    66 Host.cpp:337] [Port: 9780, Space: 61, Part: 124] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Can't find log 1 in wal, send the snapshot, logIdToSend = 97588614, firstLogId in wal = 97582907, lastLogId in wal = 97588614

Your Environments (required)

  • OS: 18.04.1-Ubuntu x86_64 GNU/Linux
  • Compiler:
    Using docker images
  • CPU: lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 16
    On-line CPU(s) list: 0-15
    Thread(s) per core: 2
    Core(s) per socket: 8
    Socket(s): 1
    NUMA node(s): 2
    Vendor ID: AuthenticAMD
    CPU family: 23
    Model: 1
    Model name: AMD EPYC 7551 32-Core Processor
    Stepping: 2
    CPU MHz: 1996.300
    BogoMIPS: 3992.60
    Hypervisor vendor: *****
    Virtualization type: full
    L1d cache: 32K
    L1i cache: 64K
    L2 cache: 512K
    L3 cache: 8192K
    NUMA node0 CPU(s): 0-7
    NUMA node1 CPU(s): 8-15
  • Commit id (e.g. a3ffc7d8)
    Not sure how to get this

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Create a Nebula cluster
    -- graphd VM count: 3 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks)
    -- metad VM count: 3 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks)
    -- storaged VM count: 9 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks)
  2. Create a space with INT64 Vid Type, 200 Patition_num, 2 Replica Factor
  3. Initially, ingest data using spark-connector
    -- vertices: 50 Million
    -- Edges: 3.3 Billon
  4. Run 'SUBMITJOB COMPACT', executed successfully
  5. Verify cluster is very healthy, physical size of Data & Log on disk, they are not significant
  6. Ingest more data, observe cluster becomes unstable

Expected behavior

Cluster should be stable for at least 1 billion vertices and 50 billion edges

Additional context

  • We did not run BALANCE DATA
  • We created massive cluster to eliminate storage and memory issues
@porscheme porscheme added the type/bug Type: something is unexpected label Sep 22, 2022
@wey-gu
Copy link
Contributor

wey-gu commented Sep 22, 2022

cc @Sophie-Xie

@liwenhui-soul
Copy link
Contributor

liwenhui-soul commented Sep 22, 2022

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

@porscheme
Copy link
Author

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

Thanks for the reply. I will try with replica factor 3 and report back.
Which is better Vid Type, INT64 or STRING?
Also, Partition_num 200 seems okay?

@liwenhui-soul
Copy link
Contributor

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?

INT64 and STRING are both ok, I think.
Partition_num 200 is ok.

@porscheme
Copy link
Author

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?

INT64 and STRING are both ok, I think. Partition_num 200 is ok.

After setting Replica Factor 3; cluster seems to be more stable.

@jinyingsunny jinyingsunny added the wontfix Solution: this will not be worked on recently label Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected wontfix Solution: this will not be worked on recently
Projects
None yet
Development

No branches or pull requests

4 participants