(1.9.0) missed heartbeats don't mark nodes down #24231

dxdc · 2024-10-16T20:36:02Z

After upgrading to Nomad v1.9.0 from v1.8.4, all client nodes report as "ready", regardless of their actual "ready" status.

Nomad version

Nomad v1.9.0
BuildDate 2024-10-10T07:13:43Z
Revision 7ad3685

Operating system and Environment details

Ubuntu 24.04 LTS

Issue

After upgrading servers from 1.8.4 to 1.9.0, all clients are reporting as "ready" even when they are down.

Reproduction steps

Not sure if it's needed, but this is the client ACL policy:

agent {
  policy = "read"
}

node {
  policy = "write"
}

namespace "*" {
  policy       = "read"
  capabilities = ["submit-job", "dispatch-job", "read-logs", "read-fs", "alloc-exec", "alloc-node-exec"]
}

Expected Result

Clients should report as down

Actual Result

Clients always report as ready

Nomad Server logs (if appropriate)

Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=5de1e8af-e595-6bdf-4221-9f01dea50822
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=9cb8c3cf-c47c-1a81-7182-c6e94f8f6ab9
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=afdec3e7-dab0-e5ec-adc6-61eb98456bbf
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=c347178d-69a2-75a9-c746-d9a546a538fc
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=f5a09534-7a1c-7a5f-9836-978e375800e7
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"

Nomad Client logs (if appropriate)

N/A. Client logs don't indicate any issues.

The text was updated successfully, but these errors were encountered:

dxdc · 2024-10-16T20:36:36Z

Btw, this also happens with brand new clients that weren't previously registered.

tgross · 2024-10-16T20:51:20Z

Hi @dxdc! Sorry to hear you're seeing trouble. #23838 feels like the obvious culprit here but I'm not sure how off the top of my head. Can you confirm that all your clients are also 1.9? Or at least >1.6.0 (see https://developer.hashicorp.com/nomad/docs/upgrade/upgrade-specific#dropped-support-for-older-clients)?

dxdc · 2024-10-16T20:53:17Z

thanks @tgross! yes. Clients with versions 1.8.4 and 1.9.0 have the same issue.

Is downgrading the server to 1.8.4 an option (and, as simple as replacing the binary)? I'd be happy to test that.

tgross · 2024-10-17T14:49:13Z

Ok @dxdc I've reproduced this and tracked down the cause. In #23838 we updated the RPC handler we use for heartbeats (Node.Update) to be more strict about requiring node secrets. This works fine for requests from the nodes. But when a node goes down, the server that the node was connected to sends that same RPC message. The server isn't properly authenticating that request. I've got a couple different ideas for how to fix this... I'll have a PR up as soon as possible.

Is downgrading the server to 1.8.4 an option (and, as simple as replacing the binary)?

Once the state store has written Raft logs that an older version doesn't understand, you can't rollback (see Upgrading) without replacing the state. You can get away with that with some patch-version updates but 1.9.0 won't be one of them because we added new Raft keyring entries.

dxdc · 2024-10-17T15:02:16Z

Thx @tgross. I ended up just rebuilding the entire cluster.

Shouldn't release 1.9.0 be yanked while a patch is made? This seems like a serious defect.

tgross · 2024-10-17T15:11:24Z

Shouldn't release 1.9.0 be yanked while a patch is made? This seems like a serious defect.

Agreed, and there's another set of fairly serious defects around the docker driver as well. Unfortunately we don't have a facility for "yanking" releases short of pinning issues warning folks about upgrades. Patch should ship fairly soon though.

In #23838 we updated the `Node.Update` RPC handler we use for heartbeats to be more strict about requiring node secrets. But when a node goes down, it's the leader that sends the request to mark the node down via `Node.Update` (to itself), and this request was missing the leader ACL needed to authenticate to itself. Add the leader ACL to the request and update the RPC handler test for disconnected-clients to use ACLs, which would have detected this bug. Also added a note to the `Authenticate` comment about how that authentication path requires the leader ACL. Fixes: #24231 Ref: https://hashicorp.atlassian.net/browse/NET-11384

tgross · 2024-10-17T17:46:38Z

When #24241 merges this issue will get closed. I've retitled and pinned it to make sure it's obvious to passerby this is an important 1.9.0 regression. We're having an internal discussion about when we can ship 1.9.1 but expect that'll be soon.

In #23838 we updated the `Node.Update` RPC handler we use for heartbeats to be more strict about requiring node secrets. But when a node goes down, it's the leader that sends the request to mark the node down via `Node.Update` (to itself), and this request was missing the leader ACL needed to authenticate to itself. Add the leader ACL to the request and update the RPC handler test for disconnected-clients to use ACLs, which would have detected this bug. Also added a note to the `Authenticate` comment about how that authentication path requires the leader ACL. Fixes: #24231 Ref: https://hashicorp.atlassian.net/browse/NET-11384

dxdc added the type/bug label Oct 16, 2024

tgross added the theme/heartbeating label Oct 16, 2024

tgross added the hcc/jira label Oct 17, 2024

tgross self-assigned this Oct 17, 2024

tgross mentioned this issue Oct 17, 2024

heartbeat: use leader's ACL token when failing heartbeat #24241

Merged

tgross added this to the 1.9.1 milestone Oct 17, 2024

tgross changed the title ~~All clients report as ready, nomad.heartbeat is showing permission denied~~ (1.9.0) missed heartbeats don't mark nodes down Oct 17, 2024

tgross pinned this issue Oct 17, 2024

tgross closed this as completed in #24241 Oct 17, 2024

hc-github-team-nomad-core mentioned this issue Oct 17, 2024

Backport of heartbeat: use leader's ACL token when failing heartbeat into release/1.9.x #24248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(1.9.0) missed heartbeats don't mark nodes down #24231

(1.9.0) missed heartbeats don't mark nodes down #24231

dxdc commented Oct 16, 2024

dxdc commented Oct 16, 2024

tgross commented Oct 16, 2024

dxdc commented Oct 16, 2024

tgross commented Oct 17, 2024

dxdc commented Oct 17, 2024

tgross commented Oct 17, 2024

tgross commented Oct 17, 2024

(1.9.0) missed heartbeats don't mark nodes down #24231

(1.9.0) missed heartbeats don't mark nodes down #24231

Comments

dxdc commented Oct 16, 2024

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

dxdc commented Oct 16, 2024

tgross commented Oct 16, 2024

dxdc commented Oct 16, 2024

tgross commented Oct 17, 2024

dxdc commented Oct 17, 2024

tgross commented Oct 17, 2024

tgross commented Oct 17, 2024