Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graphd crash during DML when node being restarted #5041

Closed
wey-gu opened this issue Dec 12, 2022 · 14 comments
Closed

graphd crash during DML when node being restarted #5041

wey-gu opened this issue Dec 12, 2022 · 14 comments
Assignees
Labels
affects/none PR/issue: this bug affects none version. need info Solution: need more information (ex. can't reproduce) process/fixed Process of bug severity/major Severity of bug type/bug Type: something is unexpected

Comments

@wey-gu
Copy link
Contributor

wey-gu commented Dec 12, 2022

Please check the FAQ documentation before raising an issue

Describe the bug (required)

When I stop services on one node of the cluster, usually on some other node graphd service crash (or stop) !
this is in graphd log on that other node which crashed graphd service

E20221207 12:11:41.647439 2874748 StorageClientBase-inl.h:206] Request to "****.87":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:11:41.648522 2874721 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:11:41.648615 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 127
E20221207 12:11:41.648627 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_RPC_FAILURE, part 73
E20221207 12:11:41.648634 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 71
E20221207 12:11:41.648659 2874721 QueryInstance.cpp:137] Storage Error: Not the leader of 127. Please retry later.
E20221207 12:11:43.679056 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 73
E20221207 12:11:43.679096 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 57
E20221207 12:11:43.679103 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 105
E20221207 12:11:43.679122 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 79
E20221207 12:11:43.679129 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_LEADER_CHANGED, part 49
E20221207 12:11:43.679150 2874721 QueryInstance.cpp:137] Storage Error: Not the leader of 73. Please retry later.
E20221207 12:12:11.708324 2874738 StorageClientBase-inl.h:206] Request to "****.87":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:12:11.711513 2874721 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:12:11.711608 2874721 StorageAccessExecutor.h:39] InsertEdgesExecutor failed, error E_RPC_FAILURE, part 36
E20221207 12:12:11.711623 2874721 StorageAccessExecutor.h:136] Storage Error: part: 36, error: E_RPC_FAILURE(-3).
E20221207 12:12:11.711642 2874721 QueryInstance.cpp:137] Storage Error: part: 36, error: E_RPC_FAILURE(-3).
E20221207 12:12:11.756379 2874765 StorageClientBase-inl.h:206] Request to "****.87":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:12:11.757710 2874718 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:12:11.757824 2874718 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 92
E20221207 12:12:11.757874 2874718 StorageAccessExecutor.h:136] Storage Error: part: 92, error: E_RPC_FAILURE(-3).
E20221207 12:12:11.757901 2874721 QueryInstance.cpp:137] Storage Error: part: 92, error: E_RPC_FAILURE(-3).
E20221207 12:14:57.237293 2874756 StorageClientBase-inl.h:206] Request to "****.88":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:14:57.238889 2874713 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:14:57.238956 2874724 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_LEADER_CHANGED, part 34
E20221207 12:14:57.238977 2874724 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 82
E20221207 12:14:57.239001 2874724 QueryInstance.cpp:137] Storage Error: Not the leader of 34. Please retry later.

also on some other node which I dind’t touch there is graphd crash.
we are executing inserts on the cluster during service restart, so there is some load on graphd services.

E20221207 12:05:35.360320 2895357 StorageClientBase-inl.h:206] Request to "****.86":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:05:35.361788 2895282 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221207 12:05:35.361891 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 111
E20221207 12:05:35.361937 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 55
E20221207 12:05:35.361946 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 47
E20221207 12:05:35.361954 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 127
E20221207 12:05:35.361961 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_LEADER_CHANGED, part 122
E20221207 12:05:35.361987 2895275 StorageAccessExecutor.h:136] Storage Error: part: 111, error: E_RPC_FAILURE(-3).
E20221207 12:05:35.362056 2895287 QueryInstance.cpp:137] Storage Error: part: 111, error: E_RPC_FAILURE(-3).

Your Environments (required)

3.3.0

How To Reproduce(required)

Steps to reproduce the behavior:

  • restarting one of the hosts when there are INSERT queries.

Expected behavior

Looks like a chaos use case, no crash should occur, some writing failure could be accepted though.

Additional context

@goranc could help provide more information when needed.

@wey-gu wey-gu added the type/bug Type: something is unexpected label Dec 12, 2022
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Dec 12, 2022
@Sophie-Xie Sophie-Xie added this to the v3.4.0 milestone Dec 12, 2022
@xtcyclist
Copy link
Contributor

xtcyclist commented Dec 12, 2022

Looks like the storaged is involved in this crash here. @pengweisong
At least the graphd shall not crash due to failures in the storaged. I'm going to check this.

@xtcyclist xtcyclist added the severity/major Severity of bug label Dec 12, 2022
@github-actions github-actions bot removed the severity/none Severity of bug label Dec 12, 2022
@critical27
Copy link
Contributor

Does graphd has any coredump? It is a basic chaos scenario which has been verified long ago. But I didn't check it recently.

@HarrisChu @kikimo , do we have this case?

@goranc
Copy link

goranc commented Dec 12, 2022

No coredump in this case.

@wey-gu
Copy link
Contributor Author

wey-gu commented Dec 12, 2022

@goranc about "When I stop services on one node of the cluster" what services are we talking about? Only StorageD? or?

@goranc
Copy link

goranc commented Dec 12, 2022

I've restarted all services on one node, invoking 'nebuladb.service stop all'
After all services stopped, as seen with services status (it take around 90s to stop storaged service)
Then started back all services on that node, and wait to be back online.
Then continue with restart to next node on cluster.
That procedure is common on many clusters, usually called 'rolling restart'.

@xtcyclist xtcyclist added the need info Solution: need more information (ex. can't reproduce) label Dec 21, 2022
@xtcyclist
Copy link
Contributor

xtcyclist commented Dec 21, 2022

Hi @goranc , crash issues are very important to us. Thanks for sharing. We still have troubles reproducing this problem. Would you please share the overall topology of your cluster? How many nodes does this cluster have? How many services does one node has? Does one node has all types of services (metad, graphd, storaged)? More information may help us reproduce this problem.

@kikimo
Copy link
Contributor

kikimo commented Dec 21, 2022

E20221207 12:05:35.361946 2895275 StorageAccessExecutor.h:39] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 47

I don't quite recall

@kikimo
Copy link
Contributor

kikimo commented Dec 21, 2022

Steps to reproduce the behavior:
restarting one of the hosts when there are INSERT queries.

The hosts here means storage? @wey-gu @xtcyclist

@xtcyclist
Copy link
Contributor

I've restarted all services on one node, invoking 'nebuladb.service stop all' After all services stopped, as seen with services status (it take around 90s to stop storaged service) Then started back all services on that node, and wait to be back online. Then continue with restart to next node on cluster. That procedure is common on many clusters, usually called 'rolling restart'.

According to this detailed description, I think all services within a node are to be restarted, which very likely includes some storaged services. @kikimo

@kikimo
Copy link
Contributor

kikimo commented Dec 21, 2022

Are there any fatal logs or stderr logs @wey-gu

@xtcyclist
Copy link
Contributor

Are there any fatal logs or stderr logs @wey-gu

We don't have more info on this for now.

@xtcyclist
Copy link
Contributor

@goranc , hi, would you please help us to make sure whether there are services crashed? My colleague tried to reproduce your case but only find the connection that keeps inserting data disconnects, with no services crashing.

@goranc
Copy link

goranc commented Dec 22, 2022

Hi,
I have 8 nodes cluster which have:
metad service on first 3 nodes
storaged service on all nodes
graphd service on all nodes

When restarting services on specific node, all services are restarted.

@xtcyclist xtcyclist removed this from the v3.4.0 milestone Jan 12, 2023
@Sophie-Xie
Copy link
Contributor

Close it first, reopen if it reappears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. need info Solution: need more information (ex. can't reproduce) process/fixed Process of bug severity/major Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

7 participants