blobstore job fails to start after vm crash or reboot #25

sykesm · 2016-07-20T14:23:08Z

Issue

The blobstore job fails to start after the VM was rebooted.

Context

The nginx.stderr.log from the failure shows the following line hundreds of times:

nginx: [emerg] open() "/var/vcap/sys/run/blobstore/nginx.pid" failed (2: No such file or directory)

The control script appears to rely on the pre-start script to setup that directory:

function setup_blobstore_directories {
  local run_dir=/var/vcap/sys/run/blobstore
  local log_dir=/var/vcap/sys/log/blobstore
  local data=/var/vcap/store/shared
  local tmp_dir=$data/tmp/uploads
  local nginx_webdav_dir=/var/vcap/packages/nginx_webdav

  mkdir -p $run_dir
  mkdir -p $log_dir
  mkdir -p $data
  mkdir -p $tmp_dir
  chown -R vcap:vcap $run_dir $log_dir $data $tmp_dir $nginx_webdav_dir "${nginx_webdav_dir}/.."
}

According to the time stamps from the log, the pre-start script did run 3-days before the reboot:

-rw-r--r-- 1 vcap vcap 41200 Jul 20 14:21 nginx.stderr.log
-rw-r--r-- 1 vcap vcap     0 Jul 17 17:23 nginx.stdout.log
-rw-r----- 1 vcap vcap   622 Jul 17 17:23 pre-start.stderr.log
-rw-r----- 1 vcap vcap     0 Jul 17 17:23 pre-start.stdout.log

Unfortunately, most of the directories that are created by that script live on temporary file systems that bosh sets up. In particular, /var/vcap/sys/run:

# df /var/vcap/data/sys/run
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs               1024    16      1008   2% /var/vcap/data/sys/run

Since it's a tmpfs, it's memory only file system and all data gets lost on a reboot. That means that the directory used for the nginx pidfile is gone when the blobstore control script starts.

Steps to Reproduce

Deploy cloud foundry
Reboot the blobstore_z1 job

Expected result

The blobstore job recovers when the reboot is complete.

Current result

The blobstore job fails to recover. This causes the cloud controllers, cloud controller workers, and the runtimes to fail.

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2016-07-20T14:23:10Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/126686223

The labels on this github issue will be updated when the story is started.

sykesm · 2016-07-20T14:26:58Z

It appears the same problem exists with the cloud controller:

[2016-07-20 14:21:03+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:21:03 UTC 2016 --------------
[2016-07-20 14:21:03+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_2.pid: No such file or directory
[2016-07-20 14:22:43+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:22:43 UTC 2016 --------------
[2016-07-20 14:22:43+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_1.pid: No such file or directory
[2016-07-20 14:23:13+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:23:13 UTC 2016 --------------
[2016-07-20 14:23:13+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_2.pid: No such file or directory
[2016-07-20 14:24:53+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:24:53 UTC 2016 --------------
[2016-07-20 14:24:53+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_1.pid: No such file or directory
[2016-07-20 14:25:23+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:25:23 UTC 2016 --------------
[2016-07-20 14:25:23+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_2.pid: No such file or directory

sykesm · 2016-07-20T14:45:21Z

Similar issues in consul: cloudfoundry-attic/consul-release#31

sax · 2016-07-21T15:29:36Z

We've made fixes in these two commits:
cloudfoundry/cloud_controller_ng@232f748
2cc5a9b

This will ensure that the directories in /var/vcap/sys/run are recreated when services start after a reboot.

When we switched our start commands to run as non-root, we moved all directory creation into pre-start, because we ran into problems with /var/vcap/sys/log not giving write permission to non-root users. It turns out that /var/vcap/sys/run gives write permission to the vcap group, so we can make those directories in start scripts.

Let us know if this fixes the issue for you, and close the issue if it does!

@sax && @adowns01

sykesm · 2016-07-23T16:23:23Z

Thanks.

Bump src/code.cloudfoundry.org/tps dependabot[bot]: Bump code.cloudfoundry.org/lager/v3 from 3.0.1 to 3.0.2 (#25) Bump github.com/cloudfoundry/dropsonde from 1.0.0 to 1.1.0 (#24)

Bump src/code.cloudfoundry.org/tps dependabot[bot]: Bump github.com/lib/pq from 1.10.7 to 1.10.9 (#28) Bump code.cloudfoundry.org/lager/v3 from 3.0.1 to 3.0.2 (#25) Bump github.com/cloudfoundry/dropsonde from 1.0.0 to 1.1.0 (#24)

…/tps Bump src/code.cloudfoundry.org/cc-uploader dependabot[bot]: Bump github.com/onsi/gomega from 1.28.1 to 1.29.0 (#25) Bump src/code.cloudfoundry.org/tps dependabot[bot]: Bump github.com/onsi/gomega from 1.28.1 to 1.29.0 (#43)

cf-gitbot added the unscheduled label Jul 20, 2016

cf-gitbot added scheduled and removed unscheduled labels Jul 20, 2016

sykesm mentioned this issue Jul 20, 2016

DNS modifications are lost on system reboot with stemcell 3262.2 and consul-release 97 cloudfoundry-attic/consul-release#31

Closed

cf-gitbot added in progress and removed scheduled labels Jul 20, 2016

cf-gitbot added delivered accepted and removed in progress delivered accepted labels Jul 21, 2016

sykesm closed this as completed Jul 23, 2016

cf-gitbot added delivered accepted and removed delivered labels Jul 23, 2016

jsievers mentioned this issue Jan 5, 2018

postgres fails to restart after VM reboot cloudfoundry/postgres-release#33

Closed

capi-bot added a commit that referenced this issue Jun 22, 2023

Bump src/code.cloudfoundry.org/tps

3424f67

Bump src/code.cloudfoundry.org/tps dependabot[bot]: Bump code.cloudfoundry.org/lager/v3 from 3.0.1 to 3.0.2 (#25) Bump github.com/cloudfoundry/dropsonde from 1.0.0 to 1.1.0 (#24)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blobstore job fails to start after vm crash or reboot #25

blobstore job fails to start after vm crash or reboot #25

sykesm commented Jul 20, 2016

cf-gitbot commented Jul 20, 2016

sykesm commented Jul 20, 2016

sykesm commented Jul 20, 2016

sax commented Jul 21, 2016

sykesm commented Jul 23, 2016

blobstore job fails to start after vm crash or reboot #25

blobstore job fails to start after vm crash or reboot #25

Comments

sykesm commented Jul 20, 2016

Issue

Context

Steps to Reproduce

Expected result

Current result

cf-gitbot commented Jul 20, 2016

sykesm commented Jul 20, 2016

sykesm commented Jul 20, 2016

sax commented Jul 21, 2016

sykesm commented Jul 23, 2016