Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nil pointer dereference if alloc has nil Job #19972

Merged
merged 3 commits into from
Feb 14, 2024
Merged

Fix nil pointer dereference if alloc has nil Job #19972

merged 3 commits into from
Feb 14, 2024

Conversation

cleroux
Copy link
Contributor

@cleroux cleroux commented Feb 14, 2024

We encountered the following on one of our production hosts:

Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc
Feb 13 22:11:31 s143 nomad-client[52792]: panic: runtime error: invalid memory address or nil pointer dereference
Feb 13 22:11:31 s143 nomad-client[52792]: [signal SIGSEGV: segmentation violation code=0x1 addr=0xe8 pc=0x1c3a9be]
Feb 13 22:11:31 s143 nomad-client[52792]: goroutine 1 [running]:
Feb 13 22:11:31 s143 nomad-client[52792]: github.com/hashicorp/nomad/nomad/structs.(*Job).LookupTaskGroup(...)
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/nomad/structs/structs.go:4805
Feb 13 22:11:31 s143 nomad-client[52792]: github.com/hashicorp/nomad/client.(*Client).hasLocalState(0xc000004c00, 0xc001000200)
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/client/client.go:1309 +0x3e
Feb 13 22:11:31 s143 nomad-client[52792]: github.com/hashicorp/nomad/client.(*Client).restoreState(0xc000004c00)
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/client/client.go:1202 +0x25e
Feb 13 22:11:31 s143 nomad-client[52792]: github.com/hashicorp/nomad/client.NewClient(0xc000251b80, {0x3536a48?, 0xc0006aa020}, {0x352c420?, 0xc000274a50}, {0x354b660?, 0xc00084ca50}, 0xc?)
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/client/client.go:560 +0x21be
Feb 13 22:11:31 s143 nomad-client[52792]: github.com/hashicorp/nomad/command/agent.(*Agent).setupClient(0xc000328360)
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/command/agent/agent.go:1082 +0x2e5
Feb 13 22:11:31 s143 nomad-client[52792]: github.com/hashicorp/nomad/command/agent.NewAgent(0xc001000800, {0x356fa48?, 0xc00061a1e0}, {0x3531800?, 0xc00100c1f8}, 0xc001070ff0)
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/command/agent/agent.go:152 +0x208
Feb 13 22:11:31 s143 nomad-client[52792]: github.com/hashicorp/nomad/command/agent.(*Command).setupAgent(0xc000ef8c00, 0xc001000800, {0x356fa48, 0xc00061a1e0}, {0x3531800, 0xc00100c1f8}, 0x0?)
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/command/agent/command.go:568 +0xaa
Feb 13 22:11:31 s143 nomad-client[52792]: github.com/hashicorp/nomad/command/agent.(*Command).Run(0xc000ef8c00, {0xc0001a61a0, 0x4, 0x4})
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/command/agent/command.go:774 +0x631
Feb 13 22:11:31 s143 nomad-client[52792]: github.com/mitchellh/cli.(*CLI).Run(0xc000e67e00)
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/mitchellh/[email protected]/cli.go:262 +0x5f8
Feb 13 22:11:31 s143 nomad-client[52792]: main.Run({0xc0001a6190, 0x5, 0x5})
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/main.go:110 +0x28a
Feb 13 22:11:31 s143 nomad-client[52792]: main.main()
Feb 13 22:11:31 s143 nomad-client[52792]: #011github.com/hashicorp/nomad/main.go:80 +0x4e

We were able to resolve the issue by deleting state.db and state.db.backup on that host.

I believe there must have been some corrupt state stored in the DB that somehow decoded to an alloc with a nil Job.

@hashicorp-cla
Copy link

hashicorp-cla commented Feb 14, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cleroux!

This looks good. Can you run make cl to add a changelog entry for this bug? Something like "client: Fixed a bug where corrupt client state could panic the client"

client/client_test.go Outdated Show resolved Hide resolved
@cleroux
Copy link
Contributor Author

cleroux commented Feb 14, 2024

Thanks, @tgross! Requested changes have been made.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Once CI is done I'll merge this and get it backported. Thanks @cleroux!

@tgross tgross added backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line labels Feb 14, 2024
@tgross
Copy link
Member

tgross commented Feb 14, 2024

The failing storybook test is unrelated and is being worked on. The flaky leadership test in tests-groups (nomad) is known, being tracked, and unrelated to this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line theme/client theme/crash
Projects
Development

Successfully merging this pull request may close these issues.

3 participants