graphd crash during DML when node being restarted #5041

wey-gu · 2022-12-12T02:30:47Z

Please check the FAQ documentation before raising an issue

Describe the bug (required)

When I stop services on one node of the cluster, usually on some other node graphd service crash (or stop) !
this is in graphd log on that other node which crashed graphd service

E20221207 12:11:41.647439 2874748 StorageClientBase-inl.h:206] Request to "****.87":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:11:41.648522 2874721 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:11:41.648615 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 127
E20221207 12:11:41.648627 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_RPC_FAILURE, part 73
E20221207 12:11:41.648634 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 71
E20221207 12:11:41.648659 2874721 QueryInstance.cpp:137] Storage Error: Not the leader of 127. Please retry later.
E20221207 12:11:43.679056 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 73
E20221207 12:11:43.679096 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 57
E20221207 12:11:43.679103 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 105
E20221207 12:11:43.679122 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 79
E20221207 12:11:43.679129 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 49
E20221207 12:11:43.679150 2874721 QueryInstance.cpp:137] Storage Error: Not the leader of 73. Please retry later.
E20221207 12:12:11.708324 2874738 StorageClientBase-inl.h:206] Request to "****.87":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:12:11.711513 2874721 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:12:11.711608 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_RPC_FAILURE, part 36
E20221207 12:12:11.711623 2874721 StorageAccessExecutor.h:136] Storage Error: part: 36, error: E_RPC_FAILURE(-3).
E20221207 12:12:11.711642 2874721 QueryInstance.cpp:137] Storage Error: part: 36, error: E_RPC_FAILURE(-3).
E20221207 12:12:11.756379 2874765 StorageClientBase-inl.h:206] Request to "****.87":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:12:11.757710 2874718 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:12:11.757824 2874718 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 92
E20221207 12:12:11.757874 2874718 StorageAccessExecutor.h:136] Storage Error: part: 92, error: E_RPC_FAILURE(-3).
E20221207 12:12:11.757901 2874721 QueryInstance.cpp:137] Storage Error: part: 92, error: E_RPC_FAILURE(-3).
E20221207 12:14:57.237293 2874756 StorageClientBase-inl.h:206] Request to "****.88":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:14:57.238889 2874713 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:14:57.238956 2874724 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_LEADER_CHANGED, part 34
E20221207 12:14:57.238977 2874724 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 82
E20221207 12:14:57.239001 2874724 QueryInstance.cpp:137] Storage Error: Not the leader of 34. Please retry later.

also on some other node which I dind’t touch there is graphd crash.
we are executing inserts on the cluster during service restart, so there is some load on graphd services.

E20221207 12:05:35.360320 2895357 StorageClientBase-inl.h:206] Request to "****.86":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:05:35.361788 2895282 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:05:35.361891 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 111
E20221207 12:05:35.361937 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 55
E20221207 12:05:35.361946 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 47
E20221207 12:05:35.361954 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 127
E20221207 12:05:35.361961 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_LEADER_CHANGED, part 122
E20221207 12:05:35.361987 2895275 StorageAccessExecutor.h:136] Storage Error: part: 111, error: E_RPC_FAILURE(-3).
E20221207 12:05:35.362056 2895287 QueryInstance.cpp:137] Storage Error: part: 111, error: E_RPC_FAILURE(-3).

Your Environments (required)

3.3.0

How To Reproduce(required)

Steps to reproduce the behavior:

restarting one of the hosts when there are INSERT queries.

Expected behavior

Looks like a chaos use case, no crash should occur, some writing failure could be accepted though.

Additional context

@goranc could help provide more information when needed.

The text was updated successfully, but these errors were encountered:

xtcyclist · 2022-12-12T07:24:19Z

Looks like the storaged is involved in this crash here. @pengweisong
At least the graphd shall not crash due to failures in the storaged. I'm going to check this.

critical27 · 2022-12-12T08:43:48Z

Does graphd has any coredump? It is a basic chaos scenario which has been verified long ago. But I didn't check it recently.

@HarrisChu @kikimo , do we have this case?

goranc · 2022-12-12T09:10:54Z

No coredump in this case.

wey-gu · 2022-12-12T09:24:56Z

@goranc about "When I stop services on one node of the cluster" what services are we talking about? Only StorageD? or?

goranc · 2022-12-12T17:25:17Z

I've restarted all services on one node, invoking 'nebuladb.service stop all'
After all services stopped, as seen with services status (it take around 90s to stop storaged service)
Then started back all services on that node, and wait to be back online.
Then continue with restart to next node on cluster.
That procedure is common on many clusters, usually called 'rolling restart'.

xtcyclist · 2022-12-21T08:13:26Z

Hi @goranc , crash issues are very important to us. Thanks for sharing. We still have troubles reproducing this problem. Would you please share the overall topology of your cluster? How many nodes does this cluster have? How many services does one node has? Does one node has all types of services (metad, graphd, storaged)? More information may help us reproduce this problem.

kikimo · 2022-12-21T08:21:58Z

E20221207 12:05:35.361946 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 47

I don't quite recall

kikimo · 2022-12-21T08:38:43Z

Steps to reproduce the behavior:
restarting one of the hosts when there are INSERT queries.

The hosts here means storage？ @wey-gu @xtcyclist

xtcyclist · 2022-12-21T08:40:57Z

I've restarted all services on one node, invoking 'nebuladb.service stop all' After all services stopped, as seen with services status (it take around 90s to stop storaged service) Then started back all services on that node, and wait to be back online. Then continue with restart to next node on cluster. That procedure is common on many clusters, usually called 'rolling restart'.

According to this detailed description, I think all services within a node are to be restarted, which very likely includes some storaged services. @kikimo

kikimo · 2022-12-21T08:44:23Z

Are there any fatal logs or stderr logs @wey-gu

xtcyclist · 2022-12-21T08:51:52Z

Are there any fatal logs or stderr logs @wey-gu

We don't have more info on this for now.

xtcyclist · 2022-12-21T10:31:54Z

@goranc , hi, would you please help us to make sure whether there are services crashed? My colleague tried to reproduce your case but only find the connection that keeps inserting data disconnects, with no services crashing.

goranc · 2022-12-22T09:33:54Z

Hi,
I have 8 nodes cluster which have:
metad service on first 3 nodes
storaged service on all nodes
graphd service on all nodes

When restarting services on specific node, all services are restarted.

Sophie-Xie · 2023-01-29T02:45:56Z

Close it first, reopen if it reappears.

wey-gu added the type/bug Type: something is unexpected label Dec 12, 2022

github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Dec 12, 2022

Sophie-Xie added this to the v3.4.0 milestone Dec 12, 2022

Sophie-Xie assigned xtcyclist Dec 12, 2022

xtcyclist assigned pengweisong Dec 12, 2022

xtcyclist added the severity/major Severity of bug label Dec 12, 2022

github-actions bot removed the severity/none Severity of bug label Dec 12, 2022

wey-gu mentioned this issue Dec 17, 2022

Weekly Report 2022-12-16 vesoft-inc/nebula-community#147

Closed

xtcyclist added the need info Solution: need more information (ex. can't reproduce) label Dec 21, 2022

xtcyclist removed this from the v3.4.0 milestone Jan 12, 2023

Sophie-Xie closed this as completed Jan 29, 2023

github-actions bot added the process/fixed Process of bug label Jan 29, 2023

wey-gu mentioned this issue Feb 4, 2023

Weekly Report 2023-02-03 vesoft-inc/nebula-community#315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graphd crash during DML when node being restarted #5041

graphd crash during DML when node being restarted #5041

wey-gu commented Dec 12, 2022

xtcyclist commented Dec 12, 2022 •

edited

Loading

critical27 commented Dec 12, 2022

goranc commented Dec 12, 2022

wey-gu commented Dec 12, 2022

goranc commented Dec 12, 2022

xtcyclist commented Dec 21, 2022 •

edited

Loading

kikimo commented Dec 21, 2022

kikimo commented Dec 21, 2022 •

edited

Loading

xtcyclist commented Dec 21, 2022

kikimo commented Dec 21, 2022

xtcyclist commented Dec 21, 2022

xtcyclist commented Dec 21, 2022

goranc commented Dec 22, 2022

Sophie-Xie commented Jan 29, 2023

graphd crash during DML when node being restarted #5041

graphd crash during DML when node being restarted #5041

Comments

wey-gu commented Dec 12, 2022

xtcyclist commented Dec 12, 2022 • edited Loading

critical27 commented Dec 12, 2022

goranc commented Dec 12, 2022

wey-gu commented Dec 12, 2022

goranc commented Dec 12, 2022

xtcyclist commented Dec 21, 2022 • edited Loading

kikimo commented Dec 21, 2022

kikimo commented Dec 21, 2022 • edited Loading

xtcyclist commented Dec 21, 2022

kikimo commented Dec 21, 2022

xtcyclist commented Dec 21, 2022

xtcyclist commented Dec 21, 2022

goranc commented Dec 22, 2022

Sophie-Xie commented Jan 29, 2023

xtcyclist commented Dec 12, 2022 •

edited

Loading

xtcyclist commented Dec 21, 2022 •

edited

Loading

kikimo commented Dec 21, 2022 •

edited

Loading