Intermittent "Stats server temporarily unavailable" after BBR unlock #115

ljfranklin · 2018-12-14T17:06:50Z

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

We intermittently see the BBR DRATs suite fail in our CI. The underlying cause is we have components which wait for CAPI's BBR unlock script to finish, then attempt to make API requests to CAPI. Occasionally (several times a week for PAS RelEng), one of these components will get the following response from CAPI:

+ cf app autoscale
Showing health and status for app autoscale in org system / space autoscaling as admin...

Stats unavailable: Stats server temporarily unavailable.
FAILED

Could the CAPI BBR unlock scripts be updated to ensure that all necessary components are ready prior to starting? Or is this a Diego issue? Honestly I wouldn't be opposed to a sleep 60 at the end of your script to brute force avoid these edge cases.

Context

Send additional questions to PAS RelEng team.

Steps to Reproduce

Attempt to run cf app FOO immediately after CAPI unlock script exits. This is an intermittent error so might not fail every time.

Expected result

cf app returns app info

Current result

Sometimes cf app returns Stats server temporarily unavailable.

Possible Fix

Ensure every unlock script in CF waits the right amount of time OR
Add sleep 60 to the CAPI unlock script :)

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2018-12-14T17:06:52Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/162668570

The labels on this github issue will be updated when the story is started.

cwlbraa · 2018-12-14T20:29:13Z

Hi @ljfranklin,

If we could declare something like "diego's bbs need to be unlocked" that error wouldn't happen... BBR doesn't let us define unlock order dependencies, right? We can add the sleep to make your life easier but it feels pretty shoddy...

tcdowney · 2018-12-14T23:11:38Z

@cwlbraa this error could also be due to various loggregator components not being healthy yet (either trafficcontroller or log-cache). Neither bbs nor trafficcontroller/log-cache have a durable database so I don't think they are even bbr aware to begin with. 😞

UAA has gotten good mileage from their sleep for what it's worth. 🤷‍♂️ Agree it doesn't feel the best, though...

https://github.com/cloudfoundry/uaa-release/blob/dd655638b44350a19f9a55bc2c29435dd7d12696/jobs/uaa/templates/bbr/post-restore-unlock.sh.erb#L8

ljfranklin · 2018-12-14T23:16:53Z

@cwlbraa you can specify order dependencies with backup_should_be_locked_before: https://docs.cloudfoundry.org/bbr/bbr-devguide.html#job-configuration. But like Tim mentioned it might not help unless all components have BBR scripts.

cwlbraa · 2018-12-15T00:00:13Z

do what works, i guess? ¯\(ツ)/¯

tcdowney · 2018-12-17T17:56:34Z

@ljfranklin

Or is this a Diego issue?

Do you happen to have logs from the api and diego-api VMs when this situation occurs? Thinking about it more, we're a bit surprised that BBS is unavailable given that it does not actually interact with bbr stuff. It's possible that contention on the internal MySQL Galera is affecting it's access to locket or it's non-durable database (maybe 😅)... but hard to tell without logs.

We want to make sure that we're adding a sleep for the right reason since the feedback cycle on these things can be pretty long.

tcdowney · 2019-03-27T16:28:31Z

We believe this PR addresses this issue, @ljfranklin:
#132

It doesn't actually address the Stats server temporarily unavailable error you may occasionally see during cf start/cf push, but that's unrelated to the BBR process and recent work such as switching to using log-cache and adding retry logic should help with the reliability of that.

cf-gitbot added the unscheduled label Dec 14, 2018

tcdowney closed this as completed Mar 27, 2019

cf-gitbot added delivered accepted and removed unscheduled delivered labels Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent "Stats server temporarily unavailable" after BBR unlock #115

Intermittent "Stats server temporarily unavailable" after BBR unlock #115

ljfranklin commented Dec 14, 2018

cf-gitbot commented Dec 14, 2018

cwlbraa commented Dec 14, 2018

tcdowney commented Dec 14, 2018

ljfranklin commented Dec 14, 2018

cwlbraa commented Dec 15, 2018 •

edited

Loading

tcdowney commented Dec 17, 2018

tcdowney commented Mar 27, 2019 •

edited

Loading

Intermittent "Stats server temporarily unavailable" after BBR unlock #115

Intermittent "Stats server temporarily unavailable" after BBR unlock #115

Comments

ljfranklin commented Dec 14, 2018

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

cf-gitbot commented Dec 14, 2018

cwlbraa commented Dec 14, 2018

tcdowney commented Dec 14, 2018

ljfranklin commented Dec 14, 2018

cwlbraa commented Dec 15, 2018 • edited Loading

tcdowney commented Dec 17, 2018

tcdowney commented Mar 27, 2019 • edited Loading

cwlbraa commented Dec 15, 2018 •

edited

Loading

tcdowney commented Mar 27, 2019 •

edited

Loading