Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.10.1-1] crash on startup without error possibly after disk full #5889

Closed
pbooth opened this issue Mar 2, 2016 · 9 comments · Fixed by #6082
Closed

[0.10.1-1] crash on startup without error possibly after disk full #5889

pbooth opened this issue Mar 2, 2016 · 9 comments · Fixed by #6082
Milestone

Comments

@pbooth
Copy link

pbooth commented Mar 2, 2016

Hello,

I have an 0.10.1-1 influxdb instanec that is failing to startup. I am almost certain that the host had a disk full situation sometime yesterday. This is what I see logged:

2016/03/02 10:02:21 InfluxDB starting, version 0.10.1, branch HEAD, commit df902a4b077bb270984303b8e4f8a320e3954b40, built 2016-02-18T20:44:27.807242
2016/03/02 10:02:21 Go version go1.4.3, GOMAXPROCS set to 16
2016/03/02 10:02:21 Using configuration at: /etc/influxdb/influxdb.conf
[meta] 2016/03/02 10:02:21 Starting meta service
[meta] 2016/03/02 10:02:21 Listening on HTTP: 127.0.0.1:8091
[metastore] 2016/03/02 10:02:21 Using data dir: /data/fusionio/fio01/apps/tpeng/influxdb/meta
[metastore] 2016/03/02 10:02:21 Node at localhost:8087 [Follower]
[metastore] 2016/03/02 10:02:22 Node at localhost:8087 [Leader]. peers=[localhost:8087]
[meta] 2016/03/02 10:02:27 127.0.0.1 - - [02/Mar/2016:10:02:27 -0500] GET /?index=0 HTTP/1.1 200 68129 - Go 1.1 package http c771bdeb-e087-11e5-8001-000000000000 40.044831ms
[store] 2016/03/02 10:02:27 Using data dir: /data/fusionio/fio01/apps/tpeng/influxdb/data
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL writing to /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4909
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4909/000000001-000000001.tsm (#0) opened in 82.923µs
[cacheloader] 2016/03/02 10:02:27 reading file /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4909/_00004.wal, size 0
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL writing to /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4921
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4921/000000001-000000001.tsm (#0) opened in 63.209µs
[cacheloader] 2016/03/02 10:02:27 reading file /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4921/_00005.wal, size 0
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/02 10:02:27 tsm1 WAL writing to /data/fusionio/fio01/apps/tpeng/influxdb/wal/_internal/monitor/4928
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000001-000000001.tsm (#0) opened in 34.36µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000002-000000001.tsm (#1) opened in 44.974µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000004-000000001.tsm (#3) opened in 11.674µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000003-000000001.tsm (#2) opened in 9.611µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000005-000000001.tsm (#4) opened in 12.349µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000007-000000001.tsm (#6) opened in 26.804µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000011-000000001.tsm (#10) opened in 10.482µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000016-000000001.tsm (#15) opened in 13.799µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000017-000000001.tsm (#16) opened in 9.799µs
...
...
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000013-000000001.tsm (#12) opened in 11.964µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000063-000000001.tsm (#62) opened in 10.446µs
run: open server: open tsdb store: failed to open shard 4928: open engine: error opening memory map for file /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000001-000000001.tsm: init: error reading magic number of file: EOF
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000091-000000001.tsm (#90) opened in 9.914µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000094-000000001.tsm (#93) opened in 11.045µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000040-000000001.tsm (#39) opened in 16.799µs
[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000092-000000001.tsm (#91) opened in 18.218µs

and then it dies

@jwilder
Copy link
Contributor

jwilder commented Mar 2, 2016

@pbooth Dies as in the process is killed and nothing in the logs? Anything in your syslog? Or is there more in the logs not shown?

Only thing I see is:

run: open server: open tsdb store: failed to open shard 4928: open engine: error opening memory map for file /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000001-000000001.tsm: init: error reading magic number of file: EOF

Which looks like the file is truncated. It's odd because there is a log line earlier that read the same file just fine:

[filestore]2016/03/02 10:02:27 /data/fusionio/fio01/apps/tpeng/influxdb/data/_internal/monitor/4928/000000001-000000001.tsm (#0) opened in 34.36µs

Once that file is written, it is never modified so I'm not sure how that would have happened. It looks like your filesystem (or something else) somehow corrupted that file entry or is not returning the data for the file when requested. Also, the file naming does not look like these TSM files were created by influxdb.

Were they converted using from some other format?

@eoltean
Copy link

eoltean commented Mar 21, 2016

@jwilder running into the same problem as pbooth on my local machine where I know that the disk never became full and I never modified or converted the files these came from a fresh install of influxdb 0.10.1-1.

@sstarcher
Copy link

Looks like I may be having a similar issue. We had some networking problems on our servers and influxdb is now angry. The disk is 5% full

[filestore] 2016/03/21 17:32:43 /data/db/_internal/monitor/216/000000004-000000001.tsm (#1) opened in 35.983µs
run: open server: open tsdb store: failed to open shard 216: open engine: error opening memory map for file /data/db/_internal/monitor/216/000000004-000000001.tsm: init: error reading magic number of file: EOF

It looks like /data/db/_internal_monitor has folders for 1-7 and certainly not 216.

@jwilder jwilder added this to the 0.11.0 milestone Mar 21, 2016
@sstarcher
Copy link

After a crash and restarting it I get the same issue. I'm going to revert my data mount to a backup. I'll take a snapshot of the bad data if you need me to debug it.

[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/212/000000008-000000001.tsm (#2) opened in 166.38µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/212/000000004-000000003.tsm (#0) opened in 350.362µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/212/000000007-000000002.tsm (#1) opened in 441.323µs
[cacheloader] 2016/03/21 17:38:16 reading file /data/db/wal/_internal/monitor/212/_00053.wal, size 0
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL writing to /data/db/wal/_internal/monitor/213
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/213/000000008-000000001.tsm (#1) opened in 132.239µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/213/000000007-000000003.tsm (#0) opened in 424.018µs
[cacheloader] 2016/03/21 17:38:16 reading file /data/db/wal/_internal/monitor/213/_00043.wal, size 0
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL writing to /data/db/wal/_internal/monitor/214
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/214/000000008-000000001.tsm (#1) opened in 121.161µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/214/000000007-000000003.tsm (#0) opened in 282.746µs
[cacheloader] 2016/03/21 17:38:16 reading file /data/db/wal/_internal/monitor/214/_00039.wal, size 0
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL writing to /data/db/wal/_internal/monitor/215
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/215/000000008-000000001.tsm (#1) opened in 125.848µs
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/215/000000007-000000003.tsm (#0) opened in 519.078µs
[cacheloader] 2016/03/21 17:38:16 reading file /data/db/wal/_internal/monitor/215/_00035.wal, size 0
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/03/21 17:38:16 tsm1 WAL writing to /data/db/wal/_internal/monitor/216
[filestore] 2016/03/21 17:38:16 /data/db/_internal/monitor/216/000000004-000000001.tsm (#1) opened in 60.945µs
run: open server: open tsdb store: failed to open shard 216: open engine: error opening memory map for file /data/db/_internal/monitor/216/000000004-000000001.tsm: init: error reading magic number of file: EOF

@jwilder
Copy link
Contributor

jwilder commented Mar 21, 2016

@sstarcher What filesystem are you using? Can you describe your hardware setup?

@sstarcher
Copy link

  • AWS
  • OS: CoreOS
  • InfluxDB Version - v0.10.2
  • InfluxDB running in Kubernetes with the data backed by EBS

@pbooth
Copy link
Author

pbooth commented Mar 21, 2016

Mine is physical dual Sandy bridge centos 6.3/6.4 with fusionio drives.

Sent from my iPhone

On Mar 21, 2016, at 2:26 PM, Shane Starcher [email protected] wrote:

AWS
OS: CoreOS
InfluxDB Version - v0.10.2
InfluxDB running in Kubernetes with the data backed by EBS

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@ukaj
Copy link

ukaj commented Mar 21, 2016

Probably the same issue with InfluxDB 0.10.1, it happens after running out of space

[http] 2016/03/21 15:00:30 10.x.x.x - - [21/Mar/2016:15:00:30 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 204 0 - InfluxDBClient a75ae98e-ef75-11e5-a818-000000000000 8.584798ms
[write] 2016/03/21 15:00:30 write failed for shard 14 on node 1: engine: error writing WAL entry: write /data/wal/telegraf/default/14/_00120.wal: no space left on device
time="2016-03-21T16:00:30.824890486+01:00" level=error msg="Failed to log msg \"\\tat org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1450)\" for logger json-file: write /var/lib/docker/containers/910ab5
[http] 2016/03/21 15:29:40 10.x.x.x - - [21/Mar/2016:15:29:40 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 500 87 - InfluxDBClient ba6a1713-ef79-11e5-8453-000000000000 60.788321ms
[write] 2016/03/21 15:29:40 write failed for shard 14 on node 1: engine: cache maximum memory size exceeded
[http] 2016/03/21 15:29:40 10.x.x.x - - [21/Mar/2016:15:29:40 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 500 87 - InfluxDBClient ba6d2580-ef79-11e5-8457-000000000000 60.336608ms
[write] 2016/03/21 15:29:40 write failed for shard 14 on node 1: engine: cache maximum memory size exceeded
[http] 2016/03/21 15:29:40 10.x.x.x - - [21/Mar/2016:15:29:40 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 500 87 - InfluxDBClient ba72961a-ef79-11e5-845d-000000000000 31.899581ms
[write] 2016/03/21 15:29:40 write failed for shard 14 on node 1: engine: cache maximum memory size exceeded
[write] 2016/03/21 15:29:40 write failed for shard 14 on node 1: engine: cache maximum memory size exceeded
[http] 2016/03/21 15:29:40 10.x.x.x - - [21/Mar/2016:15:29:40 +0000] POST /write?consistency=&db=telegraf&precision=s&rp=default HTTP/1.1 500 87 - InfluxDBClient ba79ae8a-ef79-11e5-8460-000000000000 19.600658ms

One interesting message seems to be truncated by docker syslog log driver. There is also gap in logs from 15:00:30 till 15:29:40.
Messages like above continue to appear till restarting InfluxDB.
Then after restarting it I got:

[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000032-000000001.tsm (#9) opened in 36.507µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000033-000000001.tsm (#10) opened in 71.743µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000027-000000001.tsm (#4) opened in 35.858µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000040-000000001.tsm (#17) opened in 38.706µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000029-000000001.tsm (#6) opened in 162.049µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000036-000000001.tsm (#13) opened in 14.853µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000031-000000001.tsm (#8) opened in 10.308µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000037-000000001.tsm (#14) opened in 19.637µs
run: open server: open tsdb store: failed to open shard 14: open engine: error opening memory map for file /data/db/telegraf/default/14/000000032-000000001.tsm: init: error reading magic number of file: EOF
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000030-000000001.tsm (#7) opened in 17.784µs
[filestore]2016/03/21 15:51:33 /data/db/telegraf/default/14/000000042-000000001.tsm (#19) opened in 19.188µs

My specification:
KVM + Docker (docker-compose)
OS: Debian Jessie
Data was stored on ext4 partition (LVM)

@jwilder
Copy link
Contributor

jwilder commented Mar 21, 2016

Should be fixed via #6082

jwilder added a commit that referenced this issue Mar 21, 2016
The TSM writer uses a bufio.Writer that needs to be flushed before
it's closed.  If the flush fails for some reason, the error is not
handled by the defer and the compactor continues on as if all is good.
This can create files with truncated indexes or zero-length TSM files.

Fixes #5889
jwilder added a commit that referenced this issue Mar 21, 2016
The TSM writer uses a bufio.Writer that needs to be flushed before
it's closed.  If the flush fails for some reason, the error is not
handled by the defer and the compactor continues on as if all is good.
This can create files with truncated indexes or zero-length TSM files.

Fixes #5889
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants