Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: unexpected fault address 0x0 #1424

Closed
zarnovican opened this issue Jun 29, 2016 · 18 comments
Closed

panic: unexpected fault address 0x0 #1424

zarnovican opened this issue Jun 29, 2016 · 18 comments
Labels
bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf

Comments

@zarnovican
Copy link
Contributor

Bug report

Telegraf crashed on me with unexpected fault address 0x0 without any obvious outside factor.

Relevant telegraf.conf:

Output goes to InfluxDB (0.13). There are handful of standard telegraf plugins, plus one exec plugin in multiple copies.

2016/06/29 07:25:37 Starting Telegraf (version 0.13.1)
2016/06/29 07:25:37 Loaded outputs: influxdb
2016/06/29 07:25:37 Loaded inputs: cpu diskio net system disk kernel mem netstat processes exec swap exec exec

System info:

Telegraf: 0.13.1
OS: Ubuntu Trusty 14.04.4

Steps to reproduce:

Unknown. :(

Additional info:

2016/06/29 07:18:59 Output [influxdb] buffer fullness: 18 / 10000 metrics. Total gathered metrics: 202723. Total dropped metrics: 0.
2016/06/29 07:18:59 Output [influxdb] wrote batch of 18 metrics in 15.790325ms
2016/06/29 07:20:00 Output [influxdb] buffer fullness: 33 / 10000 metrics. Total gathered metrics: 202756. Total dropped metrics: 0.
2016/06/29 07:20:00 Output [influxdb] wrote batch of 33 metrics in 43.972242ms
unexpected fault address 0x0
fatal error: fault
[signal 0xb code=0x80 addr=0x0 pc=0x7f788f]

goroutine 337678 [running]:
runtime.throw(0x12343f8, 0x5)
        /usr/local/go/src/runtime/panic.go:547 +0x90 fp=0xc820370f90 sp=0xc820370f78
runtime.sigpanic()
        /usr/local/go/src/runtime/sigpanic_unix.go:27 +0x2ab fp=0xc820370fe0 sp=0xc820370f90
encoding/json.(*encodeState).string(0xc8206fa370, 0xffffff00ff00, 0xffffff00, 0xffffff00ff00)
        /usr/local/go/src/encoding/json/encode.go:793 +0x7f fp=0xc820371050 sp=0xc820370fe0
encoding/json.stringEncoder(0xc8206fa370, 0xe84ec0, 0xc82075de88, 0x98, 0x0)
        /usr/local/go/src/encoding/json/encode.go:551 +0x44b fp=0xc820371160 sp=0xc820371050
encoding/json.(*structEncoder).encode(0xc82027fb60, 0xc8206fa370, 0x10f2460, 0xc82075de88, 0x99, 0x0)
        /usr/local/go/src/encoding/json/encode.go:587 +0x2c4 fp=0xc820371308 sp=0xc820371160
encoding/json.(*structEncoder).(encoding/json.encode)-fm(0xc8206fa370, 0x10f2460, 0xc82075de88, 0x99, 0x0)
        /usr/local/go/src/encoding/json/encode.go:601 +0x51 fp=0xc820371340 sp=0xc820371308
encoding/json.(*structEncoder).encode(0xc82027fa70, 0xc8206fa370, 0x118fe80, 0xc82075de60, 0x99, 0x0)
        /usr/local/go/src/encoding/json/encode.go:587 +0x2c4 fp=0xc8203714e8 sp=0xc820371340
encoding/json.(*structEncoder).(encoding/json.encode)-fm(0xc8206fa370, 0x118fe80, 0xc82075de60, 0x99, 0xc82075de00)
        /usr/local/go/src/encoding/json/encode.go:601 +0x51 fp=0xc820371520 sp=0xc8203714e8
encoding/json.(*encodeState).reflectValue(0xc8206fa370, 0x118fe80, 0xc82075de60, 0x99)
        /usr/local/go/src/encoding/json/encode.go:301 +0x6b fp=0xc820371550 sp=0xc820371520
encoding/json.(*encodeState).marshal(0xc8206fa370, 0x118fe80, 0xc82075de60, 0x0, 0x0)
        /usr/local/go/src/encoding/json/encode.go:274 +0xa9 fp=0xc820371598 sp=0xc820371550
encoding/json.Marshal(0x118fe80, 0xc82075de60, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/encoding/json/encode.go:139 +0x84 fp=0xc8203715d0 sp=0xc820371598
github.com/shirou/gopsutil/net.ConnectionStat.String(0x200000000, 0x1, 0xc820744540, 0x9, 0x50, 0xffffff00ff00, 0xffffff00, 0x8b87, 0x124e890, 0x9, ...)
        /home/ubuntu/telegraf-build/src/github.com/shirou/gopsutil/net/net.go:85 +0x82 fp=0xc820371680 sp=0xc8203715d0
github.com/shirou/gopsutil/net.ConnectionsPid(0x12323b8, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/shirou/gopsutil/net/net_linux.go:340 +0x987 fp=0xc820371a88 sp=0xc820371680
github.com/shirou/gopsutil/net.Connections(0x12323b8, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/shirou/gopsutil/net/net_linux.go:281 +0x50 fp=0xc820371ad0 sp=0xc820371a88
github.com/influxdata/telegraf/plugins/inputs/system.(*systemPS).NetConnections(0x1c78550, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/system/ps.go:114 +0x49 fp=0xc820371b10 sp=0xc820371ad0
github.com/influxdata/telegraf/plugins/inputs/system.(*NetStats).Gather(0xc8201ca410, 0x7f953014c228, 0xc8205fdc20, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/system/netstat.go:26 +0x7b fp=0xc820371f20 sp=0xc820371b10
github.com/influxdata/telegraf/agent.gatherWithTimeout.func1(0xc8205615c0, 0xc8201b1f80, 0xc8205fdc20)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:174 +0x73 fp=0xc820371f88 sp=0xc820371f20
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc820371f90 sp=0xc820371f88
created by github.com/influxdata/telegraf/agent.gatherWithTimeout
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:175 +0xe0

[ommiting remaining 30+ routines]

We are running a mix of 0.13.0, 0.13.1 versions on multiple AWS instances with very similar configuration. This is the first such case of Telegraf panic.

@sparrc sparrc added the bug unexpected problem or unintended behavior label Jun 29, 2016
@zarnovican
Copy link
Contributor Author

Hi,

we had two more cases of this "unexpected fault address 0x0" crash

2016/07/17 09:34:57 Output [influxdb] buffer fullness: 18 / 10000 metrics. Total gathered metrics: 468654. Total dropped metrics: 0.
2016/07/17 09:34:57 Output [influxdb] wrote batch of 18 metrics in 7.272336ms
unexpected fault address 0x0
fatal error: fault
[signal 0xb code=0x80 addr=0x0 pc=0x4aa5de]

goroutine 776169 [running]:
runtime.throw(0x12343f8, 0x5)
        /usr/local/go/src/runtime/panic.go:547 +0x90 fp=0xc82059d6f0 sp=0xc82059d6d8
runtime.sigpanic()
        /usr/local/go/src/runtime/sigpanic_unix.go:27 +0x2ab fp=0xc82059d740 sp=0xc82059d6f0
github.com/influxdata/telegraf/agent.gatherWithTimeout.func1(0x0, 0x6f74732d74726174, 0x6e6f6d6561642d70)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:174 +0x3e fp=0xc82059d7a8 sp=0xc82059d740
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc82059d7b0 sp=0xc82059d7a8
created by github.com/influxdata/telegraf/agent.gatherWithTimeout
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:175 +0xe0

goroutine 1 [semacquire, 26050 minutes]:
sync.runtime_Semacquire(0xc82000dffc)
        /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*WaitGroup).Wait(0xc82000dff0)
        /usr/local/go/src/sync/waitgroup.go:127 +0xb4
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc82001e0e8, 0xc82004cde0, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:369 +0xa36
main.main()
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:274 +0x2632

goroutine 17 [syscall, 26050 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1
...
2016/07/16 02:13:09 Output [influxdb] buffer fullness: 328 / 10000 metrics. Total gathered metrics: 7439293. Total dropped metrics: 0.
2016/07/16 02:13:09 Output [influxdb] wrote batch of 328 metrics in 21.538751ms
unexpected fault address 0x0
fatal error: fault
[signal 0xb code=0x80 addr=0x0 pc=0x4aa25e]

goroutine 1368905 [running]:
runtime.throw(0x12891c8, 0x5)
        /usr/local/go/src/runtime/panic.go:547 +0x90 fp=0xc82065e6f0 sp=0xc82065e6d8
runtime.sigpanic()
        /usr/local/go/src/runtime/sigpanic_unix.go:27 +0x2ab fp=0xc82065e740 sp=0xc82065e6f0
github.com/influxdata/telegraf/agent.gatherWithTimeout.func1(0x0, 0x7469725762757322, 0x203a22706f724465)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:163 +0x3e fp=0xc82065e7a8 sp=0xc82065e740
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc82065e7b0 sp=0xc82065e7a8
created by github.com/influxdata/telegraf/agent.gatherWithTimeout
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:164 +0xe0

goroutine 1 [semacquire, 25348 minutes]:
sync.runtime_Semacquire(0xc8201c563c)
        /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*WaitGroup).Wait(0xc8201c5630)
        /usr/local/go/src/sync/waitgroup.go:127 +0xb4
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc8200240d0, 0xc820050f00, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:339 +0xa0b
main.main()
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:274 +0x2624

goroutine 17 [syscall, 25348 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1
...

The former one was on the same box as the first one reported. The latter was on a different box. Assuming it is the same issue, this rules out HW problems.

One host is running 0.13.1, the other one is running 1.0.0-beta2. I haven't seen this type of crash on 0.13.0, so I suspect, it was introduced in 0.13.1.

@sparrc
Copy link
Contributor

sparrc commented Jul 18, 2016

@zarnovican do you see any other messages in your logs about errors? and are you able to provide your configuration & any example metrics generated by exec?

@zarnovican
Copy link
Contributor Author

Hi, @sparrc

do you see any other messages in your logs about errors?

No, there is nothing except the exception and plenty of stack traces.. I don't know enough Go to tell which one caused it, so I always report the first one. Here is the full output.
https://gist.github.com/zarnovican/69aff702be10c3e0f609e55487a64cff

and are you able to provide your configuration

Config is split into several files, here is the concatenated config:

[global_tags]
  # pass these EC2 tags to InfluxDB
  Name = "hoover"
  group = "hoover"

[agent]
  hostname = "xxxxxxxx"
  interval = "60s"
  flush_interval = "60s"
  flush_jitter = "10s"

[[outputs.influxdb]]
  urls = ["http://XXXXXXXX:8086"]
  database = "telegraf"
  precision = "s"

[[inputs.cpu]]
  percpu = true
  totalcpu = true

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.net]]
  interfaces = ["eth0"]

[[inputs.netstat]]

[[inputs.processes]]

[[inputs.exec]]
  commands = ["/usr/lib/telegraf/scripts/sumproc.sh -f /var/run/telegraf/telegraf.pid"]
  data_format = "influx"
  [inputs.exec.tags]
    process = "telegraf"

[[inputs.swap]]

[[inputs.system]]

[[inputs.exec]]
  commands = ["/usr/lib/telegraf/scripts/sumproc.sh -f /var/run/icinga2/icinga2.pid"]
  data_format = "influx"
  [inputs.exec.tags]
    process = "icinga"

[[inputs.exec]]
  commands = ["/usr/lib/telegraf/scripts/sumproc.sh -p uwsgi"]
  data_format = "influx"
  [inputs.exec.tags]
    process = "uwsgi"

BTW, I had to run telegraf as root to be able to read /proc/<pid>/smaps.

any example metrics generated by exec?

$ sudo /usr/lib/telegraf/scripts/sumproc.sh -p uwsgi
sumproc cpu_user=133.10,cpu_system=73.83,nprocess=13i,mem_pss=344880128i,mem_rss=373559296i,io_read=6021120i,io_write=1495867392i,io_syscr=99905630i,io_syscw=5394789i,fd_count=189i

The sumproc.sh script is on https://gist.github.com/zarnovican/9840d171d9a9c6de1766

@sparrc
Copy link
Contributor

sparrc commented Jul 19, 2016

It's possible that this is related to #1432, which has been fixed in version 1.0.0-beta3, do you think you could try that build and see if it resolves the issue?

@zarnovican
Copy link
Contributor Author

@sparrc Both hosts were updated to 1.0.0-beta3. I'll comment this ticket if it will crash again.

@sparrc
Copy link
Contributor

sparrc commented Jul 19, 2016

thank you @zarnovican!

@zarnovican
Copy link
Contributor Author

It crashed again (on 1.0.0-beta3). This time, there was only one stack trace in the log. And the message is different, but it may be still the same problem.

2016/07/27 13:22:07 Output [influxdb] buffer fullness: 387 / 10000 metrics. Total gathered metrics: 4279251. Total dropped metrics: 0.
2016/07/27 13:22:07 Output [influxdb] wrote batch of 387 metrics in 24.067969ms
2016/07/27 13:23:02 ERROR in input [redis]: Errors encountered: [Unable to connect to redis server 'tcp:': dial tcp: lookup tcp on 172.16.0.23:53: no such host]
2016/07/27 13:23:08 Output [influxdb] buffer fullness: 384 / 10000 metrics. Total gathered metrics: 4279635. Total dropped metrics: 0.
2016/07/27 13:23:08 Output [influxdb] wrote batch of 384 metrics in 20.716519ms
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x0]

goroutine 678530 [running]:
panic(0x12a82e0, 0xc820010110)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
created by github.com/influxdata/telegraf/agent.gatherWithTimeout
    /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:164 +0xe0
2016/07/27 14:26:36 Starting Telegraf (version 1.0.0-beta3)
2016/07/27 14:26:36 Loaded outputs: influxdb
2016/07/27 14:26:36 Loaded inputs: net processes exec diskio mem kernel netstat swap system cpu disk exec exec exec influxdb postgresql exec exec redis exec
2016/07/27 14:26:36 Tags enabled: Name=monitoring group=monitoring host=i-d774125f
2016/07/27 14:26:36 Agent Config: Interval:1m0s, Debug:false, Quiet:false, Hostname:"i-d774125f", Flush Interval:1m0s 
2016/07/27 14:27:01 ERROR in input [redis]: Errors encountered: [Unable to connect to redis server 'tcp:': dial tcp: lookup tcp on 172.16.0.23:53: no such host]

(the error from redis input is unrelated, and it is fixed already AFAIK)

@sparrc Is there anything that could be done to help you pinpoint the root-cause ? Like, core dump, or some debug mode ?

@sparrc
Copy link
Contributor

sparrc commented Jul 27, 2016

That is a very peculiar panic....If I could provide you with a binary with the race-detector builtin, would you be able to use it?

Important to note that this binary would perform a bit slower.

@zarnovican
Copy link
Contributor Author

If I could provide you with a binary with the race-detector builtin, would you be able to use it?

@sparrc no problem. Alternatively, I can build it myself if you would give me instructions.

@sparrc
Copy link
Contributor

sparrc commented Jul 28, 2016

@zarnovican do you have a Go environment? if so it's just these three steps:

go get github.com/sparrc/gdm
gdm restore
go install -race -ldflags "-X main.version=`git describe --tags`" ./...

@zarnovican
Copy link
Contributor Author

@sparrc "race" build is deployed on two hosts. It seems it is working..

$ ls -l /usr/bin/telegraf.orig /usr/bin/telegraf
-rwxr-xr-x 1 root root 45101928 Jul 28 08:36 /usr/bin/telegraf
-rwxrwxr-x 1 root root 39480269 Jul 18 20:05 /usr/bin/telegraf.orig
$ tail -f /var/log/telegraf/telegraf.log
2016/07/28 08:48:03 Output [influxdb] buffer fullness: 17 / 10000 metrics. Total gathered metrics: 231432. Total dropped metrics: 0.
2016/07/28 08:48:03 Output [influxdb] wrote batch of 17 metrics in 70.791607ms
2016/07/28 08:48:25 Starting Telegraf (version 1.0.0-beta3-21-g841729c)
2016/07/28 08:48:25 Loaded outputs: influxdb
2016/07/28 08:48:25 Loaded inputs: net exec system cpu diskio kernel mem netstat processes swap disk exec exec
2016/07/28 08:48:25 Tags enabled: Name=hoover group=hoover host=i-56644bdc
2016/07/28 08:48:25 Agent Config: Interval:1m0s, Debug:false, Quiet:false, Hostname:"i-56644bdc", Flush Interval:1m0s 
2016/07/28 08:50:07 Output [influxdb] buffer fullness: 29 / 10000 metrics. Total gathered metrics: 29. Total dropped metrics: 0.
2016/07/28 08:50:07 Output [influxdb] wrote batch of 29 metrics in 21.203377ms

BTW, I have built it on go 1.5.x, if it makes any difference (probably does).

$ rpm -q golang
golang-1.5.3-1.fc23.x86_64

@zarnovican
Copy link
Contributor Author

The build with -race is consuming a lot of memory (and growing).

selection_036

Temporary workaround is to restart it from time to time, but I will not be able to leave it in production like this for too long.

@steverweber
Copy link

steverweber commented Aug 10, 2016

@butitsnotme perhaps are crash related. @zarnovican are you using exec?
edit: yap nm i clearly see you are using exec

@zarnovican
Copy link
Contributor Author

FYI

The issue appeared in Telegraf 1.0.0 as well.

2016/09/29 21:07:07 Output [influxdb] buffer fullness: 221 / 10000 metrics. Total gathered metrics: 2511678. Total dropped metrics: 0.
2016/09/29 21:07:08 Output [influxdb] wrote batch of 221 metrics in 95.903655ms
unexpected fault address 0x0
fatal error: fault
[signal 0x7 code=0x80 addr=0x0 pc=0x40dcb5]

goroutine 1668450 [running]:
runtime.throw(0x14cbe48, 0x5)
        /usr/local/go/src/runtime/panic.go:547 +0x90 fp=0xc82090b0c8 sp=0xc82090b0b0
runtime.sigpanic()
        /usr/local/go/src/runtime/sigpanic_unix.go:21 +0x10c fp=0xc82090b118 sp=0xc82090b0c8
runtime.mapassign1(0x1056440, 0xc8206bef30, 0xc82084b720, 0xc82084b700)
        /usr/local/go/src/runtime/hashmap.go:472 +0x235 fp=0xc82090b1c0 sp=0xc82090b118
reflect.mapassign(0x1056440, 0xc8206bef30, 0xc82084b720, 0xc82084b700)
        /usr/local/go/src/runtime/hashmap.go:1003 +0x3f fp=0xc82090b1e8 sp=0xc82090b1c0
reflect.Value.SetMapIndex(0x1056440, 0xc820862050, 0x195, 0x10581c0, 0xc82084b720, 0x98, 0x10581c0, 0xc82084b700, 0x198)
        /usr/local/go/src/reflect/value.go:1476 +0x259 fp=0xc82090b258 sp=0xc82090b1e8
encoding/json.(*decodeState).object(0xc820abba28, 0x1056440, 0xc820862050, 0x195)
        /usr/local/go/src/encoding/json/decode.go:691 +0xe14 fp=0xc82090b600 sp=0xc82090b258
...

Full log https://gist.github.com/zarnovican/cfa366d167cf0f2c537b55d61c1c0bfb#file-crash-2016-09-29

@sparrc
Copy link
Contributor

sparrc commented Oct 4, 2016

@zarnovican it looks like the panic this time occured in the influxdb input plugin. I took a look at the code where it panicked and the only thing I can imagine happening here is memory corruption on your machine.

The reason being that the panic is occuring during a JSON decode into an interface, but the interface is defined immediately before (there is no risk of a race condition), and the JSON itself is checked as being valid JSON as well.

Do you have any other thoughts on this? Have you seen any other odd memory behavior on these machines? are you running any sort of peculiar workload on these machines?

@zarnovican
Copy link
Contributor Author

@sparrc thanks for your time looking into this.

only thing I can imagine happening here is memory corruption on your machine

As I have mentioned before, this has happened on two machines running on 0.13.1 at that time. It is possible that AWS has memory issue on both of them..

Here is a full log from the other machine (we call hoover) back from July https://gist.github.com/zarnovican/cfa366d167cf0f2c537b55d61c1c0bfb#file-hoover-crash-2016-07-17

So far, it has happened only once on 1.0.0.

Do you have any other thoughts on this? Have you seen any other odd memory behavior on these machines? are you running any sort of peculiar workload on these machines?

The host where it is happening more frequently is running Icinga2+Postgres, InfluxDB+Grafana, Sentry+Postgres+Redis. All services are related to monitoring. It is running on AWS m1.small. It has only one CPU core, so it is utilized quite heavily. We don't have any memory-related issues with any other process on the same host.

What those two machines have in common, is that they are both paravirtual (both m1.small). What I will do is to update Telegraf on all my hosts to see if it will appear somewhere else.

@sparrc sparrc added the panic issue that results in panics from Telegraf label Oct 6, 2016
@timhallinflux
Copy link
Contributor

@zarnovican -- are you still experiencing this issue with either 1.1 editions or with the 1.2 editions of the stack? Please let us know.

@zarnovican
Copy link
Contributor Author

@timhallinflux no, I haven't seen it for couple of months, I believe.

On the other hand, I'm restarting telegraf daily now (for unrelated reasons), which would workaround the problem anyway. Closing..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf
Projects
None yet
Development

No branches or pull requests

4 participants