Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaling: fix state store corruption bug for job scaling events #23673

Merged
merged 2 commits into from
Jul 24, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Jul 23, 2024

When updating a JobScalingEvent, the state store function did not copy the existing object before mutating it. This corrupts the state store because it modifies the leaf node without committing it in a transaction. It can also cause the Nomad server to crash with a "fatal error: concurrent map read and map write" if its ScalingEvents map is read via the ScaleStatus RPC at the same time as it's being written.

This changeset also removes some mostly-unused public methods on the struct that dangerously encourage you to mutate it outside of a copy.

Ref: https://hashicorp.atlassian.net/browse/NET-10529

When updating a `JobScalingEvent`, the state store function did not copy the
existing object before mutating it. This corrupts the state store because it
modifies the leaf node without committing it in a transaction. It can also cause
the Nomad server to crash with a "fatal error: concurrent map read and map
write" if its `ScalingEvents` map is read via the `ScaleStatus` RPC at the same
time as it's being written.

This changeset also removes some mostly-unused public methods on the struct that
dangerously encourage you to mutate it outside of a copy.

Ref: https://hashicorp.atlassian.net/browse/NET-10529
@tgross tgross force-pushed the b-scaling-event-state-store branch from 67b88cd to b112548 Compare July 23, 2024 19:42
@tgross tgross added theme/crash type/bug backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line labels Jul 23, 2024
@tgross tgross added this to the 1.8.3 milestone Jul 23, 2024
@tgross tgross added the theme/autoscaling Issues related to supporting autoscaling label Jul 23, 2024
@tgross tgross marked this pull request as ready for review July 23, 2024 20:00
@tgross tgross requested review from jrasell and shoenig July 23, 2024 20:00
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @tgross!

nomad/structs/structs.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line theme/autoscaling Issues related to supporting autoscaling theme/crash type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants