Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(1.9.0) missed heartbeats don't mark nodes down #24231

Closed
dxdc opened this issue Oct 16, 2024 · 7 comments · Fixed by #24241
Closed

(1.9.0) missed heartbeats don't mark nodes down #24231

dxdc opened this issue Oct 16, 2024 · 7 comments · Fixed by #24241

Comments

@dxdc
Copy link

dxdc commented Oct 16, 2024

After upgrading to Nomad v1.9.0 from v1.8.4, all client nodes report as "ready", regardless of their actual "ready" status.

Nomad version

Nomad v1.9.0
BuildDate 2024-10-10T07:13:43Z
Revision 7ad3685

Operating system and Environment details

Ubuntu 24.04 LTS

Issue

After upgrading servers from 1.8.4 to 1.9.0, all clients are reporting as "ready" even when they are down.

Reproduction steps

Not sure if it's needed, but this is the client ACL policy:

agent {
  policy = "read"
}

node {
  policy = "write"
}

namespace "*" {
  policy       = "read"
  capabilities = ["submit-job", "dispatch-job", "read-logs", "read-fs", "alloc-exec", "alloc-node-exec"]
}

Expected Result

  • Clients should report as down

Actual Result

  • Clients always report as ready

Nomad Server logs (if appropriate)

Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=5de1e8af-e595-6bdf-4221-9f01dea50822
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=9cb8c3cf-c47c-1a81-7182-c6e94f8f6ab9
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=afdec3e7-dab0-e5ec-adc6-61eb98456bbf
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=c347178d-69a2-75a9-c746-d9a546a538fc
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [WARN]  nomad.heartbeat: node TTL expired: node_id=f5a09534-7a1c-7a5f-9836-978e375800e7
Oct 16 18:51:53 ip-172-31-35-53 nomad[48126]:     2024-10-16T18:51:53.070Z [ERROR] nomad.heartbeat: update node status failed: error="Permission denied"

Nomad Client logs (if appropriate)

  • N/A. Client logs don't indicate any issues.
@dxdc dxdc added the type/bug label Oct 16, 2024
@dxdc
Copy link
Author

dxdc commented Oct 16, 2024

Btw, this also happens with brand new clients that weren't previously registered.

@tgross
Copy link
Member

tgross commented Oct 16, 2024

Hi @dxdc! Sorry to hear you're seeing trouble. #23838 feels like the obvious culprit here but I'm not sure how off the top of my head. Can you confirm that all your clients are also 1.9? Or at least >1.6.0 (see https://developer.hashicorp.com/nomad/docs/upgrade/upgrade-specific#dropped-support-for-older-clients)?

@dxdc
Copy link
Author

dxdc commented Oct 16, 2024

thanks @tgross! yes. Clients with versions 1.8.4 and 1.9.0 have the same issue.

Is downgrading the server to 1.8.4 an option (and, as simple as replacing the binary)? I'd be happy to test that.

@tgross tgross self-assigned this Oct 17, 2024
@tgross
Copy link
Member

tgross commented Oct 17, 2024

Ok @dxdc I've reproduced this and tracked down the cause. In #23838 we updated the RPC handler we use for heartbeats (Node.Update) to be more strict about requiring node secrets. This works fine for requests from the nodes. But when a node goes down, the server that the node was connected to sends that same RPC message. The server isn't properly authenticating that request. I've got a couple different ideas for how to fix this... I'll have a PR up as soon as possible.

Is downgrading the server to 1.8.4 an option (and, as simple as replacing the binary)?

Once the state store has written Raft logs that an older version doesn't understand, you can't rollback (see Upgrading) without replacing the state. You can get away with that with some patch-version updates but 1.9.0 won't be one of them because we added new Raft keyring entries.

@dxdc
Copy link
Author

dxdc commented Oct 17, 2024

Thx @tgross. I ended up just rebuilding the entire cluster.

Shouldn't release 1.9.0 be yanked while a patch is made? This seems like a serious defect.

@tgross
Copy link
Member

tgross commented Oct 17, 2024

Shouldn't release 1.9.0 be yanked while a patch is made? This seems like a serious defect.

Agreed, and there's another set of fairly serious defects around the docker driver as well. Unfortunately we don't have a facility for "yanking" releases short of pinning issues warning folks about upgrades. Patch should ship fairly soon though.

tgross added a commit that referenced this issue Oct 17, 2024
In #23838 we updated the `Node.Update` RPC handler we use for heartbeats to be
more strict about requiring node secrets. But when a node goes down, it's the
leader that sends the request to mark the node down via `Node.Update` (to
itself), and this request was missing the leader ACL needed to authenticate to
itself.

Add the leader ACL to the request and update the RPC handler test for
disconnected-clients to use ACLs, which would have detected this bug. Also added
a note to the `Authenticate` comment about how that authentication path requires
the leader ACL.

Fixes: #24231
Ref: https://hashicorp.atlassian.net/browse/NET-11384
tgross added a commit that referenced this issue Oct 17, 2024
In #23838 we updated the `Node.Update` RPC handler we use for heartbeats to be
more strict about requiring node secrets. But when a node goes down, it's the
leader that sends the request to mark the node down via `Node.Update` (to
itself), and this request was missing the leader ACL needed to authenticate to
itself.

Add the leader ACL to the request and update the RPC handler test for
disconnected-clients to use ACLs, which would have detected this bug. Also added
a note to the `Authenticate` comment about how that authentication path requires
the leader ACL.

Fixes: #24231
Ref: https://hashicorp.atlassian.net/browse/NET-11384
@tgross tgross added this to the 1.9.1 milestone Oct 17, 2024
@tgross tgross changed the title All clients report as ready, nomad.heartbeat is showing permission denied (1.9.0) missed heartbeats don't mark nodes down Oct 17, 2024
@tgross tgross pinned this issue Oct 17, 2024
@tgross
Copy link
Member

tgross commented Oct 17, 2024

When #24241 merges this issue will get closed. I've retitled and pinned it to make sure it's obvious to passerby this is an important 1.9.0 regression. We're having an internal discussion about when we can ship 1.9.1 but expect that'll be soon.

tgross added a commit that referenced this issue Oct 17, 2024
In #23838 we updated the `Node.Update` RPC handler we use for heartbeats to be
more strict about requiring node secrets. But when a node goes down, it's the
leader that sends the request to mark the node down via `Node.Update` (to
itself), and this request was missing the leader ACL needed to authenticate to
itself.

Add the leader ACL to the request and update the RPC handler test for
disconnected-clients to use ACLs, which would have detected this bug. Also added
a note to the `Authenticate` comment about how that authentication path requires
the leader ACL.

Fixes: #24231
Ref: https://hashicorp.atlassian.net/browse/NET-11384
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants