[0.10.1-1] crash on startup without error possibly after disk full #5889

pbooth · 2016-03-02T21:33:26Z

Hello,

I have an 0.10.1-1 influxdb instanec that is failing to startup. I am almost certain that the host had a disk full situation sometime yesterday. This is what I see logged:

2016/03/02 10:02:21 InfluxDB starting, version 0.10.1, branch HEAD, commit df902a4b077bb270984303b8e4f8a320e3954b40, built 2016-02-18T20:44:27.807242
2016/03/02 10:02:21 Go version go1.4.3, GOMAXPROCS set to 16
2016/03/02 10:02:21 Using configuration at: /etc/influxdb/influxdb.conf
[meta] 2016/03/02 10:02:21 Starting meta service
[meta] 2016/03/02 10:02:21 Listening on HTTP: 127.0.0.1:8091
[metastore] 2016/03/02 10:02:21 Using data dir: /data/fusionio/fio01/apps/tpeng/influxdb/meta
[metastore] 2016/03/02 10:02:21 Node at localhost:8087 [Follower]
[metastore] 2016/03/02 10:02:22 Node at localhost:8087 [Leader]. peers=[localhost:8087]
[meta] 2016/03/02 10:02:27 127.0.0.1 - - [02/Mar/2016:10:02:27 -0500] GET /?index=0 HTTP/1.1 200 68129 - Go 1.1 package http c771bdeb-e087-11e5-8001-000000000000 40.044831ms
[store] 2016/03/02 10:02:27 Using data dir: /data/fusionio/fio01/apps/tpeng/influxdb/data
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL writing to /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4909
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4909/000000001-000000001.tsm (#0) opened in 82.923µs
[cacheloader] 2016/03/02 10:02:27 reading file /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4909/_00004.wal, size 0
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL writing to /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4921
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4921/000000001-000000001.tsm (#0) opened in 63.209µs
[cacheloader] 2016/03/02 10:02:27 reading file /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4921/_00005.wal, size 0
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL writing to /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4928
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000001-000000001.tsm (#0) opened in 34.36µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000002-000000001.tsm (#1) opened in 44.974µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000004-000000001.tsm (#3) opened in 11.674µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000003-000000001.tsm (#2) opened in 9.611µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000005-000000001.tsm (#4) opened in 12.349µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000007-000000001.tsm (#6) opened in 26.804µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000011-000000001.tsm (#10) opened in 10.482µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000016-000000001.tsm (#15) opened in 13.799µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000017-000000001.tsm (#16) opened in 9.799µs
...
...
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000013-000000001.tsm (#12) opened in 11.964µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000063-000000001.tsm (#62) opened in 10.446µs
run: open server: open tsdb store: failed to open shard 4928: open engine: error opening memory map for file /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000001-000000001.tsm: init: error reading magic number of file: EOF
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000091-000000001.tsm (#90) opened in 9.914µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000094-000000001.tsm (#93) opened in 11.045µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000040-000000001.tsm (#39) opened in 16.799µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000092-000000001.tsm (#91) opened in 18.218µs

and then it dies

The text was updated successfully, but these errors were encountered:

jwilder · 2016-03-02T22:01:30Z

@pbooth Dies as in the process is killed and nothing in the logs? Anything in your syslog? Or is there more in the logs not shown?

Only thing I see is:

run: open server: open tsdb store: failed to open shard 4928: open engine: error opening memory map for file /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000001-000000001.tsm: init: error reading magic number of file: EOF

Which looks like the file is truncated. It's odd because there is a log line earlier that read the same file just fine:

[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000001-000000001.tsm (#0) opened in 34.36µs

Once that file is written, it is never modified so I'm not sure how that would have happened. It looks like your filesystem (or something else) somehow corrupted that file entry or is not returning the data for the file when requested. Also, the file naming does not look like these TSM files were created by influxdb.

Were they converted using from some other format?

eoltean · 2016-03-21T16:54:13Z

@jwilder running into the same problem as pbooth on my local machine where I know that the disk never became full and I never modified or converted the files these came from a fresh install of influxdb 0.10.1-1.

sstarcher · 2016-03-21T17:34:21Z

Looks like I may be having a similar issue. We had some networking problems on our servers and influxdb is now angry. The disk is 5% full

[filestore] 2016/03/21 17:32:43 /data/db/_internal/monitor/216/000000004-000000001.tsm (#1) opened in 35.983µs
run: open server: open tsdb store: failed to open shard 216: open engine: error opening memory map for file /data/db/_internal/monitor/216/000000004-000000001.tsm: init: error reading magic number of file: EOF

It looks like /data/db/_internal_monitor has folders for 1-7 and certainly not 216.

sstarcher · 2016-03-21T17:39:06Z

After a crash and restarting it I get the same issue. I'm going to revert my data mount to a backup. I'll take a snapshot of the bad data if you need me to debug it.

[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/212/000000008-000000001.tsm (#2) opened in 166.38µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/212/000000004-000000003.tsm (#0) opened in 350.362µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/212/000000007-000000002.tsm (#1) opened in 441.323µs
[cacheloader] 2016/03/21 17:38:16 reading file /data/db/wal/_internal/monitor/212/_00053.wal, size 0
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL writing to /data/db/wal/_internal/monitor/213
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/213/000000008-000000001.tsm (#1) opened in 132.239µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/213/000000007-000000003.tsm (#0) opened in 424.018µs
[cacheloader] 2016/03/21 17:38:16 reading file /data/db/wal/_internal/monitor/213/_00043.wal, size 0
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL writing to /data/db/wal/_internal/monitor/214
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/214/000000008-000000001.tsm (#1) opened in 121.161µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/214/000000007-000000003.tsm (#0) opened in 282.746µs
[cacheloader] 2016/03/21 17:38:16 reading file /data/db/wal/_internal/monitor/214/_00039.wal, size 0
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL writing to /data/db/wal/_internal/monitor/215
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/215/000000008-000000001.tsm (#1) opened in 125.848µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/215/000000007-000000003.tsm (#0) opened in 519.078µs
[cacheloader] 2016/03/21 17:38:16 reading file /data/db/wal/_internal/monitor/215/_00035.wal, size 0
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL writing to /data/db/wal/_internal/monitor/216
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/216/000000004-000000001.tsm (#1) opened in 60.945µs
run: open server: open tsdb store: failed to open shard 216: open engine: error opening memory map for file /data/db/_internal/monitor/216/000000004-000000001.tsm: init: error reading magic number of file: EOF

jwilder · 2016-03-21T18:18:47Z

@sstarcher What filesystem are you using? Can you describe your hardware setup?

sstarcher · 2016-03-21T18:25:57Z

AWS
OS: CoreOS
InfluxDB Version - v0.10.2
InfluxDB running in Kubernetes with the data backed by EBS

pbooth · 2016-03-21T19:26:57Z

Mine is physical dual Sandy bridge centos 6.3/6.4 with fusionio drives.

Sent from my iPhone

On Mar 21, 2016, at 2:26 PM, Shane Starcher [email protected] wrote:

AWS
OS: CoreOS
InfluxDB Version - v0.10.2
InfluxDB running in Kubernetes with the data backed by EBS
—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

ukaj · 2016-03-21T20:00:21Z

Probably the same issue with InfluxDB 0.10.1, it happens after running out of space

[http] 2016/03/21 15:00:30 10.x.x.x - - [21/Mar/2016:15:00:30 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 204 0 - InfluxDBClient a75ae98e-ef75-11e5-a818-000000000000 8.584798ms
[write] 2016/03/21 15:00:30 write failed for shard 14 on node 1: engine: error writing WAL entry: write /data/wal/telegraf/default/14/_00120.wal: no space left on device
time="2016-03-21T16:00:30.824890486+01:00" level=error msg="Failed to log msg \"\\tat org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1450)\" for logger json-file: write /var/lib/docker/containers/910ab5
[http] 2016/03/21 15:29:40 10.x.x.x - - [21/Mar/2016:15:29:40 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 500 87 - InfluxDBClient ba6a1713-ef79-11e5-8453-000000000000 60.788321ms
[write] 2016/03/21 15:29:40 write failed for shard 14 on node 1: engine: cache maximum memory size exceeded
[http] 2016/03/21 15:29:40 10.x.x.x - - [21/Mar/2016:15:29:40 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 500 87 - InfluxDBClient ba6d2580-ef79-11e5-8457-000000000000 60.336608ms
[write] 2016/03/21 15:29:40 write failed for shard 14 on node 1: engine: cache maximum memory size exceeded
[http] 2016/03/21 15:29:40 10.x.x.x - - [21/Mar/2016:15:29:40 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 500 87 - InfluxDBClient ba72961a-ef79-11e5-845d-000000000000 31.899581ms
[write] 2016/03/21 15:29:40 write failed for shard 14 on node 1: engine: cache maximum memory size exceeded
[write] 2016/03/21 15:29:40 write failed for shard 14 on node 1: engine: cache maximum memory size exceeded
[http] 2016/03/21 15:29:40 10.x.x.x - - [21/Mar/2016:15:29:40 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 500 87 - InfluxDBClient ba79ae8a-ef79-11e5-8460-000000000000 19.600658ms

One interesting message seems to be truncated by docker syslog log driver. There is also gap in logs from 15:00:30 till 15:29:40.
Messages like above continue to appear till restarting InfluxDB.
Then after restarting it I got:

[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000032-000000001.tsm (#9) opened in 36.507µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000033-000000001.tsm (#10) opened in 71.743µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000027-000000001.tsm (#4) opened in 35.858µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000040-000000001.tsm (#17) opened in 38.706µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000029-000000001.tsm (#6) opened in 162.049µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000036-000000001.tsm (#13) opened in 14.853µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000031-000000001.tsm (#8) opened in 10.308µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000037-000000001.tsm (#14) opened in 19.637µs
run: open server: open tsdb store: failed to open shard 14: open engine: error opening memory map for file /data/db/telegraf/default/14/000000032-000000001.tsm: init: error reading magic number of file: EOF
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000030-000000001.tsm (#7) opened in 17.784µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000042-000000001.tsm (#19) opened in 19.188µs

My specification:
KVM + Docker (docker-compose)
OS: Debian Jessie
Data was stored on ext4 partition (LVM)

jwilder · 2016-03-21T20:44:21Z

Should be fixed via #6082

The TSM writer uses a bufio.Writer that needs to be flushed before it's closed. If the flush fails for some reason, the error is not handled by the defer and the compactor continues on as if all is good. This can create files with truncated indexes or zero-length TSM files. Fixes #5889

jwilder added this to the 0.11.0 milestone Mar 21, 2016

jwilder mentioned this issue Mar 21, 2016

Fix partially written TSM files #6082

Merged

4 tasks

jwilder closed this as completed in #6082 Mar 21, 2016

beckettsean mentioned this issue Mar 25, 2016

[0.10.1] panic: runtime error: slice bounds out of range #6022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.10.1-1] crash on startup without error possibly after disk full #5889

[0.10.1-1] crash on startup without error possibly after disk full #5889

pbooth commented Mar 2, 2016

jwilder commented Mar 2, 2016

eoltean commented Mar 21, 2016

sstarcher commented Mar 21, 2016

sstarcher commented Mar 21, 2016

jwilder commented Mar 21, 2016

sstarcher commented Mar 21, 2016

pbooth commented Mar 21, 2016

ukaj commented Mar 21, 2016

jwilder commented Mar 21, 2016

[0.10.1-1] crash on startup without error possibly after disk full #5889

[0.10.1-1] crash on startup without error possibly after disk full #5889

Comments

pbooth commented Mar 2, 2016

jwilder commented Mar 2, 2016

eoltean commented Mar 21, 2016

sstarcher commented Mar 21, 2016

sstarcher commented Mar 21, 2016

jwilder commented Mar 21, 2016

sstarcher commented Mar 21, 2016

pbooth commented Mar 21, 2016

ukaj commented Mar 21, 2016

jwilder commented Mar 21, 2016