Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Node Metadata causes cluster failure #16517

Closed
beninghton opened this issue Mar 16, 2023 · 3 comments · Fixed by #16549
Closed

Updating Node Metadata causes cluster failure #16517

beninghton opened this issue Mar 16, 2023 · 3 comments · Fixed by #16549

Comments

@beninghton
Copy link

beninghton commented Mar 16, 2023

Nomad version

Nomad v1.5.0
BuildDate 2023-03-01T10:11:42Z
Revision fc40c49

Operating system and Environment details

NAME="Fedora Linux"
VERSION="35.20220313.3.1 (CoreOS)"

Issue

Updating metadata via UI causes cluster raft failure

Reproduction steps

Cluster has 3 nodes.
Usually updating metadata goes fine but sometimes we face the following:
We update metadata via UI (click Edit, click on Update) - nothing happens.
Then one or two nodes of the cluster become unavailable.

Expected Result

Metadata is updated.

Actual Result

Cluster fails.
Also during this issue we have all RAM consumed on the failed node (total 4GB, usually ~3 GB is always free if cluster operates normally).
Restarting nomad agent on server helps but sometimes we have to reboot the server because no connection to it. It has CPU/RAM 100% utilized.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.272Z [ERROR] http: request failed: method=POST path=/v1/client/metadata?node_id=48bf2ad5-c4b4-cd5c-d470-aadc6b7938a1 error="rpc error: EOF" code=500
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.294Z [WARN] nomad.stats_fetcher: error getting server health: server=nomad-server-3.vsl.wsoft.live.global error="rpc error: EOF"
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.963Z [WARN] nomad.raft: failed to contact: server-id=9b389def-c056-2f59-df08-6cc65922b376 time=634.015909ms
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.970Z [WARN] nomad.raft: failed to contact quorum of nodes, stepping down
Mar 16 10:56:26 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:26.010Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 94a0b9b7-0980-2222-d912-1bf88ab8832e 10.1.13.203:4647}" error="msgpack decode error [pos 0]: read tcp 10.1.13.202:49134->10.1.13.203:4647: i/o timeout"
Mar 16 10:56:26 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:26.216Z [WARN] nomad.stats_fetcher: failed retrieving server health: server=nomad-server-2.vsl.wsoft.live.global error="context deadline exceeded"
Mar 16 10:56:26 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:26.221Z [WARN] nomad.stats_fetcher: failed retrieving server health: server=nomad-server-1.vsl.wsoft.live.global error="context deadline exceeded"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.285Z [INFO] nomad.raft: entering follower state: follower="Node at 10.1.13.202:4647 [Follower]" leader-address= leader-id=
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.337Z [INFO] nomad.raft: aborting pipeline replication: peer="{Voter 9b389def-c056-2f59-df08-6cc65922b376 10.1.13.201:4647}"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.350Z [INFO] nomad: cluster leadership lost
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.350Z [WARN] nomad.stats_fetcher: failed retrieving server health: server=nomad-server-2.vsl.wsoft.live.global error="context canceled"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.939Z [ERROR] nomad.autopilot: Error when computing next state: error="context canceled"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.962Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.1.13.203:4647 backoff time=40ms error="msgpack decode error [pos 0]: read tcp 10.1.13.202:49136->10.1.13.203:4647: i/o timeout"
Mar 16 10:56:33 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:31.103Z [ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached
Mar 16 10:56:53 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:34.933Z [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"

Nomad Client logs (if appropriate)

@schmichael
Copy link
Member

Yikes! Thanks for the report @beninghton. I'll look into it immediately.

@schmichael schmichael added this to the 1.5.2 milestone Mar 16, 2023
@blinkinglight
Copy link

blinkinglight commented Mar 16, 2023

nomad 1.5.1
can confirm. 3 node cluster - with 2 roles: client and server and 1 in client node.
if i add metadata to masters - it fails. but somehow if i add to client node, it begins to work in cluster too without any crash
btw, in cluster nodes took 50gb of ram and growing...

@schmichael schmichael changed the title Updating Metadata causes cluster failure Updating Node Metadata causes cluster failure Mar 16, 2023
@schmichael schmichael pinned this issue Mar 16, 2023
@schmichael
Copy link
Member

Thanks @blinkinglight. I reproduced it along with @shoenig, and I think I have a fix. Hoping to have a PR up tomorrow and a 1.5.2 going out soon after.

In the mean time I pinned the issue and users should only update client metadata directly against the Node they're updating. Putting it another way: a bug on Servers agents can cause crashes when receiving and forwarding Node Metadata requests.

schmichael added a commit that referenced this issue Mar 17, 2023
Fixes #16517

Given a 3 Server cluster with at least 1 Client connected to Follower 1:

If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.

The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.

Follower 1, seeing AllowStale=false, will forward the request to the
Leader.

The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
schmichael added a commit that referenced this issue Mar 20, 2023
Fixes #16517

Given a 3 Server cluster with at least 1 Client connected to Follower 1:

If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.

The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.

Follower 1, seeing AllowStale=false, will forward the request to the
Leader.

The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
@shoenig shoenig unpinned this issue Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

4 participants