SIGSEGV on startup of nomad client since 0.5.3 #2256

hynek · 2017-01-31T10:58:23Z

Nomad version

Output from nomad version

Nomad v0.5.3

64bit, tried both the LXC and non-LXC versions.

Operating system and Environment details

Host: Ubuntu Xenial, running in a LXC container.
Docker for Jobs:

Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64
 Experimental: false

Issue

After updating to 0.5.3, nomad agent crashes on startup if started in client mode.

There seems to be some kind of correlation between the presence and absence of jobs on that node (I just upgraded without draining and ended up with a useless cluster). I’ve attached those two types of crashes.

Currently I’m unable to get one of my nodes back up. :( The others for some reason are working again.

Let me know if you need any more intel or if you know have any hints on how to resolve this…

Reproduction steps

Start it.

Nomad Server logs (if appropriate)

n/a, server works fine.

Nomad Client logs (if appropriate)

No jobs, clean docker

c-2001:~# /usr/local/bin/nomad agent -config /etc/nomad/client.hcl
    Loaded configuration from /etc/nomad/client.hcl
==> Starting Nomad agent...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5ed497]

goroutine 70 [running]:
panic(0x10227c0, 0xc420012090)
	/opt/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/nomad/client/driver.(*CreatedResources).Copy(0x0, 0x1255710)
	/opt/gopath/src/github.com/hashicorp/nomad/client/driver/driver.go:108 +0x57
github.com/hashicorp/nomad/client.(*TaskRunner).SaveState(0xc42000f600, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:307 +0xc5
github.com/hashicorp/nomad/client.(*TaskRunner).setState(0xc42000f600, 0x119d299, 0x7, 0xc420189d40)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:338 +0x3c
github.com/hashicorp/nomad/client.(*TaskRunner).createDriver.func1(0x11b3444, 0x17, 0xc4201dcde0, 0x2, 0x2)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:380 +0x1f0
github.com/hashicorp/nomad/client/driver.(*DockerDriver).pullImage(0xc420454cd0, 0xc4201861c0, 0xc42039c640, 0xc42011d0b0, 0x1c, 0xc42011d0cd, 0x6, 0x2, 0x6)
	/opt/gopath/src/github.com/hashicorp/nomad/client/driver/docker.go:1007 +0x2bc
github.com/hashicorp/nomad/client/driver.(*DockerDriver).createImage(0xc420454cd0, 0xc4201861c0, 0xc42039c640, 0xc4203c6b60, 0x0, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/driver/docker.go:972 +0x1af
github.com/hashicorp/nomad/client/driver.(*DockerDriver).Prestart(0xc420454cd0, 0xc4201dc080, 0xc420184820, 0x0, 0x0, 0xc4200ffe1c)
	/opt/gopath/src/github.com/hashicorp/nomad/client/driver/docker.go:383 +0xe2
github.com/hashicorp/nomad/client.(*TaskRunner).startTask(0xc42000f600, 0xc420142960, 0x0)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:1160 +0x27d
github.com/hashicorp/nomad/client.(*TaskRunner).run(0xc42000f600)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:902 +0x38a
github.com/hashicorp/nomad/client.(*TaskRunner).Run(0xc42000f600)
	/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:444 +0x6a1
created by github.com/hashicorp/nomad/client.(*AllocRunner).RestoreState
	/opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:190 +0x891
c-2001:~#

Jobs present, still running inside Docker

Jan 31 11:16:53 c-2001 nomad-client[901]:     Loaded configuration from /etc/nomad/client.hcl
Jan 31 11:16:53 c-2001 nomad-client[901]: ==> Starting Nomad agent...
Jan 31 11:16:57 c-2001 nomad-client[901]: ==> Nomad agent configuration:
Jan 31 11:16:57 c-2001 nomad-client[901]:                  Atlas: <disabled>
Jan 31 11:16:57 c-2001 nomad-client[901]:                 Client: true
Jan 31 11:16:57 c-2001 nomad-client[901]:              Log Level: INFO
Jan 31 11:16:57 c-2001 nomad-client[901]:                 Region: global (DC: scaleup)
Jan 31 11:16:57 c-2001 nomad-client[901]:                 Server: false
Jan 31 11:16:57 c-2001 nomad-client[901]:                Version: 0.5.3
Jan 31 11:16:57 c-2001 nomad-client[901]: ==> Nomad agent started! Log data will stream in below:
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.707555 [INFO] client: using state directory /vrmd/nomad/client
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.707632 [INFO] client: using alloc directory /vrmd/nomad/alloc
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.708129 [INFO] fingerprint.cgroups: cgroups are available
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.712605 [INFO] fingerprint.consul: consul agent is available
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.737729 [INFO] driver.docker: re-attaching to docker process: 5e46cc5404ba8634d31b586be4bb7b03381be29f67c7987cf900a559fe3d0071
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.738667 [ERR] client: failed to open handle to task 'httpbin' for alloc '9c30dfb4-d9b3-6adb-148d-eea7b53eee9d': Failed to find container 5e46cc5404ba8634d31b586be4bb7b03381be29f67c7987cf900a559fe3d0071
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.740146 [INFO] driver.docker: re-attaching to docker process: 4407630c7f5a072387402dc64b323b0113ceac614ea8310e4c2ecf32ddcde64f
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.740957 [ERR] client: failed to open handle to task 'hashi-ui' for alloc 'c6440e03-e17e-64f6-704b-c656598c474e': Failed to find container 4407630c7f5a072387402dc64b323b0113ceac614ea8310e4c2ecf32ddcde64f
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.740990 [INFO] client: Node ID "1353a057-dcfb-def7-3385-87c75884b01e"
Jan 31 11:16:57 c-2001 nomad-client[901]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 31 11:16:57 c-2001 nomad-client[901]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5eb974]
Jan 31 11:16:57 c-2001 nomad-client[901]: goroutine 28 [running]:
Jan 31 11:16:57 c-2001 nomad-client[901]: panic(0x10001e0, 0xc420010080)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/go/src/runtime/panic.go:500 +0x1a1
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client/driver.(*CreatedResources).Merge(0x0, 0xc4200241d0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/driver/driver.go:127 +0xe4
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client.(*TaskRunner).startTask(0xc42013d1e0, 0xc42040bda0, 0x0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:1164 +0x2e3
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client.(*TaskRunner).run(0xc42013d1e0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:902 +0x38a
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client.(*TaskRunner).Run(0xc42013d1e0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:444 +0x6a1
Jan 31 11:16:57 c-2001 nomad-client[901]: created by github.com/hashicorp/nomad/client.(*AllocRunner).RestoreState
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:190 +0x891

The text was updated successfully, but these errors were encountered:

tantra35 · 2017-01-31T13:53:50Z

We have the same when upgraded nomad to 0.5.3, without node-drain:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5eb647]

goroutine 335 [running]:
panic(0x10001e0, 0xc420012060)
        /opt/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/nomad/client/driver.(*CreatedResources).Copy(0x0, 0x122e5b0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/driver/driver.go:108 +0x57
github.com/hashicorp/nomad/client.(*TaskRunner).SaveState(0xc42008b4a0, 0x0, 0x0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:307 +0xc5
github.com/hashicorp/nomad/client.(*AllocRunner).saveTaskRunnerState(0xc4200be1e0, 0xc42008b4a0, 0x1, 0x1)
        /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:249 +0x35
github.com/hashicorp/nomad/client.(*AllocRunner).SaveState(0xc4200be1e0, 0xc420450150, 0xc420830c10)
        /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:215 +0xe9
github.com/hashicorp/nomad/client.(*Client).saveState(0xc4202cd040, 0x1188f94, 0x13)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:604 +0x120
github.com/hashicorp/nomad/client.(*Client).runAllocs(0xc4202cd040, 0xc4202ecbc0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1507 +0x797
github.com/hashicorp/nomad/client.(*Client).run(0xc4202cd040)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:969 +0x119
created by github.com/hashicorp/nomad/client.NewClient
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:298 +0xc7a
    Loaded configuration from /etc/nomad/nomad.json

after we have done some cleanup (stop jobs that was placed on upgraded node, also we have made node-drain for upgraded node), nomad begins work as expected

dadgar · 2017-01-31T16:56:36Z

Hey sorry this happened 👎 to recover can you delete the clients data_dir and bring it back up.

dadgar · 2017-01-31T17:01:28Z

We will make sure 0.5.4 allows an in-place upgrade path for those who would like to wait!

tantra35 · 2017-01-31T17:14:53Z

Alex, may i make conclusion, that make nomad node-drain, then upgrade will have done safe?

dadgar · 2017-01-31T17:21:30Z

Potentially not. The client has some state files in the data_dir that it tries to restore from. In 0.5.3 we introduced new fields in that state_file and the upgrade isn't being handled properly it seems.

So I suggest you nomad node-drain and then delete the data_dir and bring the client back up

schmichael · 2017-01-31T17:52:47Z

Repro'd in like 30s using Nomad 0.5.2 and 0.5.3 binaries with the example.nomad Redis job.

Very embarrassed I let this slip in. Fix coming.

Combined with b522c47 this fixes #2256 Without these two commits in place upgrades to 0.5.3 panics.

holtwilkins · 2017-02-01T01:11:11Z

Hey @schmichael , any idea when 0.5.4 will be up at https://releases.hashicorp.com/nomad/ ?

EDIT: it's there now!

jippi · 2017-02-01T07:43:06Z

@schmichael we love you anyway!

schmichael · 2017-02-01T17:39:17Z

@holtwilkins It would have been up sooner, but this was only the second time I've driven a release and was pretty slow at it. Thanks for your patience!

github-actions · 2022-12-16T02:12:27Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug theme/client labels Jan 31, 2017

dadgar added this to the v0.5.4 milestone Jan 31, 2017

schmichael self-assigned this Jan 31, 2017

schmichael added a commit that referenced this issue Jan 31, 2017

Handle createdResourcs=nil

2822dd7

Combined with b522c47 this fixes #2256 Without these two commits in place upgrades to 0.5.3 panics.

schmichael mentioned this issue Jan 31, 2017

Handle createdResourcs=nil #2257

Merged

schmichael closed this as completed in #2257 Jan 31, 2017

schmichael added a commit that referenced this issue Jan 31, 2017

Mention #2256 in 0.5.4 changelog

acbec60

justinwalz mentioned this issue Feb 1, 2017

Clean up allocs that previously failed cleanup #1497

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV on startup of nomad client since 0.5.3 #2256

SIGSEGV on startup of nomad client since 0.5.3 #2256

hynek commented Jan 31, 2017

tantra35 commented Jan 31, 2017 •

edited

Loading

dadgar commented Jan 31, 2017

dadgar commented Jan 31, 2017

tantra35 commented Jan 31, 2017 •

edited

Loading

dadgar commented Jan 31, 2017

schmichael commented Jan 31, 2017

holtwilkins commented Feb 1, 2017 •

edited

Loading

jippi commented Feb 1, 2017

schmichael commented Feb 1, 2017

github-actions bot commented Dec 16, 2022

SIGSEGV on startup of nomad client since 0.5.3 #2256

SIGSEGV on startup of nomad client since 0.5.3 #2256

Comments

hynek commented Jan 31, 2017

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

No jobs, clean docker

Jobs present, still running inside Docker

tantra35 commented Jan 31, 2017 • edited Loading

dadgar commented Jan 31, 2017

dadgar commented Jan 31, 2017

tantra35 commented Jan 31, 2017 • edited Loading

dadgar commented Jan 31, 2017

schmichael commented Jan 31, 2017

holtwilkins commented Feb 1, 2017 • edited Loading

jippi commented Feb 1, 2017

schmichael commented Feb 1, 2017

github-actions bot commented Dec 16, 2022

tantra35 commented Jan 31, 2017 •

edited

Loading

tantra35 commented Jan 31, 2017 •

edited

Loading

holtwilkins commented Feb 1, 2017 •

edited

Loading