scaling: fix state store corruption bug for job scaling events #23673

tgross · 2024-07-23T19:41:56Z

When updating a JobScalingEvent, the state store function did not copy the existing object before mutating it. This corrupts the state store because it modifies the leaf node without committing it in a transaction. It can also cause the Nomad server to crash with a "fatal error: concurrent map read and map write" if its ScalingEvents map is read via the ScaleStatus RPC at the same time as it's being written.

This changeset also removes some mostly-unused public methods on the struct that dangerously encourage you to mutate it outside of a copy.

Ref: https://hashicorp.atlassian.net/browse/NET-10529

When updating a `JobScalingEvent`, the state store function did not copy the existing object before mutating it. This corrupts the state store because it modifies the leaf node without committing it in a transaction. It can also cause the Nomad server to crash with a "fatal error: concurrent map read and map write" if its `ScalingEvents` map is read via the `ScaleStatus` RPC at the same time as it's being written. This changeset also removes some mostly-unused public methods on the struct that dangerously encourage you to mutate it outside of a copy. Ref: https://hashicorp.atlassian.net/browse/NET-10529

jrasell

LGTM, thanks @tgross!

nomad/structs/structs.go

tgross force-pushed the b-scaling-event-state-store branch from 67b88cd to b112548 Compare July 23, 2024 19:42

tgross added theme/crash type/bug backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line labels Jul 23, 2024

tgross added this to the 1.8.3 milestone Jul 23, 2024

tgross added the theme/autoscaling Issues related to supporting autoscaling label Jul 23, 2024

vercel bot deployed to Preview – nomad-ui July 23, 2024 19:44 View deployment

tgross marked this pull request as ready for review July 23, 2024 20:00

tgross requested review from jrasell and shoenig July 23, 2024 20:00

schmichael approved these changes Jul 23, 2024

View reviewed changes

jrasell approved these changes Jul 24, 2024

View reviewed changes

nomad/structs/structs.go Outdated Show resolved Hide resolved

address comments on code review

4d0ec01

vercel bot deployed to Preview – nomad-ui July 24, 2024 13:01 View deployment

tgross merged commit 92d216f into main Jul 24, 2024
19 checks passed

tgross deleted the b-scaling-event-state-store branch July 24, 2024 13:18

hc-github-team-nomad-core mentioned this pull request Jul 24, 2024

Backport of scaling: fix state store corruption bug for job scaling events into release/1.8.x #23678

Merged

tgross mentioned this pull request Jul 24, 2024

wrap memdb methods to return Copy instead of any #23682

Open

tgross mentioned this pull request Aug 7, 2024

SignClaims : invalid memory address or nil pointer dereference #23758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling: fix state store corruption bug for job scaling events #23673

scaling: fix state store corruption bug for job scaling events #23673

tgross commented Jul 23, 2024

jrasell left a comment

scaling: fix state store corruption bug for job scaling events #23673

scaling: fix state store corruption bug for job scaling events #23673

Conversation

tgross commented Jul 23, 2024

jrasell left a comment

Choose a reason for hiding this comment