Updating Node Metadata causes cluster failure #16517

beninghton · 2023-03-16T11:09:40Z

Nomad version

Nomad v1.5.0
BuildDate 2023-03-01T10:11:42Z
Revision fc40c49

Operating system and Environment details

NAME="Fedora Linux"
VERSION="35.20220313.3.1 (CoreOS)"

Issue

Updating metadata via UI causes cluster raft failure

Reproduction steps

Cluster has 3 nodes.
Usually updating metadata goes fine but sometimes we face the following:
We update metadata via UI (click Edit, click on Update) - nothing happens.
Then one or two nodes of the cluster become unavailable.

Expected Result

Metadata is updated.

Actual Result

Cluster fails.
Also during this issue we have all RAM consumed on the failed node (total 4GB, usually ~3 GB is always free if cluster operates normally).
Restarting nomad agent on server helps but sometimes we have to reboot the server because no connection to it. It has CPU/RAM 100% utilized.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.272Z [ERROR] http: request failed: method=POST path=/v1/client/metadata?node_id=48bf2ad5-c4b4-cd5c-d470-aadc6b7938a1 error="rpc error: EOF" code=500
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.294Z [WARN] nomad.stats_fetcher: error getting server health: server=nomad-server-3.vsl.wsoft.live.global error="rpc error: EOF"
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.963Z [WARN] nomad.raft: failed to contact: server-id=9b389def-c056-2f59-df08-6cc65922b376 time=634.015909ms
Mar 16 10:56:25 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:25.970Z [WARN] nomad.raft: failed to contact quorum of nodes, stepping down
Mar 16 10:56:26 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:26.010Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 94a0b9b7-0980-2222-d912-1bf88ab8832e 10.1.13.203:4647}" error="msgpack decode error [pos 0]: read tcp 10.1.13.202:49134->10.1.13.203:4647: i/o timeout"
Mar 16 10:56:26 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:26.216Z [WARN] nomad.stats_fetcher: failed retrieving server health: server=nomad-server-2.vsl.wsoft.live.global error="context deadline exceeded"
Mar 16 10:56:26 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:26.221Z [WARN] nomad.stats_fetcher: failed retrieving server health: server=nomad-server-1.vsl.wsoft.live.global error="context deadline exceeded"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.285Z [INFO] nomad.raft: entering follower state: follower="Node at 10.1.13.202:4647 [Follower]" leader-address= leader-id=
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.337Z [INFO] nomad.raft: aborting pipeline replication: peer="{Voter 9b389def-c056-2f59-df08-6cc65922b376 10.1.13.201:4647}"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.350Z [INFO] nomad: cluster leadership lost
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.350Z [WARN] nomad.stats_fetcher: failed retrieving server health: server=nomad-server-2.vsl.wsoft.live.global error="context canceled"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.939Z [ERROR] nomad.autopilot: Error when computing next state: error="context canceled"
Mar 16 10:56:27 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:27.962Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.1.13.203:4647 backoff time=40ms error="msgpack decode error [pos 0]: read tcp 10.1.13.202:49136->10.1.13.203:4647: i/o timeout"
Mar 16 10:56:33 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:31.103Z [ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached
Mar 16 10:56:53 nomad-server-2.vsl.wsoft.live nomad[7667]: 2023-03-16T10:56:34.933Z [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

schmichael · 2023-03-16T17:40:08Z

Yikes! Thanks for the report @beninghton. I'll look into it immediately.

blinkinglight · 2023-03-16T20:17:14Z

nomad 1.5.1
can confirm. 3 node cluster - with 2 roles: client and server and 1 in client node.
if i add metadata to masters - it fails. but somehow if i add to client node, it begins to work in cluster too without any crash
btw, in cluster nodes took 50gb of ram and growing...

schmichael · 2023-03-17T00:16:02Z

Thanks @blinkinglight. I reproduced it along with @shoenig, and I think I have a fix. Hoping to have a PR up tomorrow and a 1.5.2 going out soon after.

In the mean time I pinned the issue and users should only update client metadata directly against the Node they're updating. Putting it another way: a bug on Servers agents can cause crashes when receiving and forwarding Node Metadata requests.

Fixes #16517 Given a 3 Server cluster with at least 1 Client connected to Follower 1: If a NodeMeta.{Apply,Read} for the Client request is received by Follower 1 with `AllowStale = false` the Follower will forward the request to the Leader. The Leader, not being connected to the target Client, will forward the RPC to Follower 1. Follower 1, seeing AllowStale=false, will forward the request to the Leader. The Leader, not being connected to... well hoppefully you get the picture: an infinite loop occurs.

beninghton added the type/bug label Mar 16, 2023

shoenig added the stage/needs-investigation label Mar 16, 2023

schmichael self-assigned this Mar 16, 2023

schmichael added this to the 1.5.2 milestone Mar 16, 2023

schmichael changed the title ~~Updating Metadata causes cluster failure~~ Updating Node Metadata causes cluster failure Mar 16, 2023

schmichael added the theme/crash label Mar 16, 2023

schmichael pinned this issue Mar 16, 2023

schmichael mentioned this issue Mar 17, 2023

client/metadata: fix crasher caused by AllowStale = false #16549

Merged

schmichael closed this as completed in #16549 Mar 20, 2023

hc-github-team-nomad-core mentioned this issue Mar 20, 2023

Backport of client/metadata: fix crasher caused by AllowStale = false into release/1.5.x #16576

Merged

schmichael mentioned this issue Mar 21, 2023

core: add 50 hop limit to rpc forwarding #16577

Draft

shoenig unpinned this issue Apr 4, 2023

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Node Metadata causes cluster failure #16517

Updating Node Metadata causes cluster failure #16517

beninghton commented Mar 16, 2023 •

edited

Loading

schmichael commented Mar 16, 2023

blinkinglight commented Mar 16, 2023 •

edited

Loading

schmichael commented Mar 17, 2023

Updating Node Metadata causes cluster failure #16517

Updating Node Metadata causes cluster failure #16517

Comments

beninghton commented Mar 16, 2023 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

schmichael commented Mar 16, 2023

blinkinglight commented Mar 16, 2023 • edited Loading

schmichael commented Mar 17, 2023

beninghton commented Mar 16, 2023 •

edited

Loading

blinkinglight commented Mar 16, 2023 •

edited

Loading