-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Node Metadata causes cluster failure #16517
Comments
Yikes! Thanks for the report @beninghton. I'll look into it immediately. |
nomad 1.5.1 |
Thanks @blinkinglight. I reproduced it along with @shoenig, and I think I have a fix. Hoping to have a PR up tomorrow and a 1.5.2 going out soon after. In the mean time I pinned the issue and users should only update client metadata directly against the Node they're updating. Putting it another way: a bug on Servers agents can cause crashes when receiving and forwarding Node Metadata requests. |
Fixes #16517 Given a 3 Server cluster with at least 1 Client connected to Follower 1: If a NodeMeta.{Apply,Read} for the Client request is received by Follower 1 with `AllowStale = false` the Follower will forward the request to the Leader. The Leader, not being connected to the target Client, will forward the RPC to Follower 1. Follower 1, seeing AllowStale=false, will forward the request to the Leader. The Leader, not being connected to... well hoppefully you get the picture: an infinite loop occurs.
Fixes #16517 Given a 3 Server cluster with at least 1 Client connected to Follower 1: If a NodeMeta.{Apply,Read} for the Client request is received by Follower 1 with `AllowStale = false` the Follower will forward the request to the Leader. The Leader, not being connected to the target Client, will forward the RPC to Follower 1. Follower 1, seeing AllowStale=false, will forward the request to the Leader. The Leader, not being connected to... well hoppefully you get the picture: an infinite loop occurs.
Nomad version
Nomad v1.5.0
BuildDate 2023-03-01T10:11:42Z
Revision fc40c49
Operating system and Environment details
NAME="Fedora Linux"
VERSION="35.20220313.3.1 (CoreOS)"
Issue
Updating metadata via UI causes cluster raft failure
Reproduction steps
Cluster has 3 nodes.
Usually updating metadata goes fine but sometimes we face the following:
We update metadata via UI (click Edit, click on Update) - nothing happens.
Then one or two nodes of the cluster become unavailable.
Expected Result
Metadata is updated.
Actual Result
Cluster fails.
Also during this issue we have all RAM consumed on the failed node (total 4GB, usually ~3 GB is always free if cluster operates normally).
Restarting nomad agent on server helps but sometimes we have to reboot the server because no connection to it. It has CPU/RAM 100% utilized.
Job file (if appropriate)
Nomad Server logs (if appropriate)
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.272Z [ERROR] http: request failed: method=POST path=/v1/client/metadata?node_id=48bf2ad5-c4b4-cd5c-d470-aadc6b7938a1 error="rpc error: EOF" code=500
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.294Z [WARN] nomad.stats_fetcher: error getting server health: server=nomad-server-3.vsl.wsoft.live.global error="rpc error: EOF"
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.963Z [WARN] nomad.raft: failed to contact: server-id=9b389def-c056-2f59-df08-6cc65922b376 time=634.015909ms
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.970Z [WARN] nomad.raft: failed to contact quorum of nodes, stepping down
Mar 16 10:56:26 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:26.010Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 94a0b9b7-0980-2222-d912-1bf88ab8832e 10.1.13.203:4647}" error="msgpack decode error [pos 0]: read tcp 10.1.13.202:49134->10.1.13.203:4647: i/o timeout"
Mar 16 10:56:26 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:26.216Z [WARN] nomad.stats_fetcher: failed retrieving server health: server=nomad-server-2.vsl.wsoft.live.global error="context deadline exceeded"
Mar 16 10:56:26 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:26.221Z [WARN] nomad.stats_fetcher: failed retrieving server health: server=nomad-server-1.vsl.wsoft.live.global error="context deadline exceeded"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.285Z [INFO] nomad.raft: entering follower state: follower="Node at 10.1.13.202:4647 [Follower]" leader-address= leader-id=
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.337Z [INFO] nomad.raft: aborting pipeline replication: peer="{Voter 9b389def-c056-2f59-df08-6cc65922b376 10.1.13.201:4647}"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.350Z [INFO] nomad: cluster leadership lost
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.350Z [WARN] nomad.stats_fetcher: failed retrieving server health: server=nomad-server-2.vsl.wsoft.live.global error="context canceled"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.939Z [ERROR] nomad.autopilot: Error when computing next state: error="context canceled"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.962Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.1.13.203:4647 backoff time=40ms error="msgpack decode error [pos 0]: read tcp 10.1.13.202:49136->10.1.13.203:4647: i/o timeout"
Mar 16 10:56:33 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:31.103Z [ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached
Mar 16 10:56:53 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:34.933Z [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: