Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent "Stats server temporarily unavailable" after BBR unlock #115

Closed
ljfranklin opened this issue Dec 14, 2018 · 7 comments
Closed
Labels

Comments

@ljfranklin
Copy link
Contributor

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

We intermittently see the BBR DRATs suite fail in our CI. The underlying cause is we have components which wait for CAPI's BBR unlock script to finish, then attempt to make API requests to CAPI. Occasionally (several times a week for PAS RelEng), one of these components will get the following response from CAPI:

+ cf app autoscale
Showing health and status for app autoscale in org system / space autoscaling as admin...

Stats unavailable: Stats server temporarily unavailable.
FAILED

Could the CAPI BBR unlock scripts be updated to ensure that all necessary components are ready prior to starting? Or is this a Diego issue? Honestly I wouldn't be opposed to a sleep 60 at the end of your script to brute force avoid these edge cases.

Context

Send additional questions to PAS RelEng team.

Steps to Reproduce

Attempt to run cf app FOO immediately after CAPI unlock script exits. This is an intermittent error so might not fail every time.

Expected result

cf app returns app info

Current result

Sometimes cf app returns Stats server temporarily unavailable.

Possible Fix

  • Ensure every unlock script in CF waits the right amount of time OR
  • Add sleep 60 to the CAPI unlock script :)
@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/162668570

The labels on this github issue will be updated when the story is started.

@cwlbraa
Copy link
Contributor

cwlbraa commented Dec 14, 2018

Hi @ljfranklin,

If we could declare something like "diego's bbs need to be unlocked" that error wouldn't happen... BBR doesn't let us define unlock order dependencies, right? We can add the sleep to make your life easier but it feels pretty shoddy...

@tcdowney
Copy link
Member

@cwlbraa this error could also be due to various loggregator components not being healthy yet (either trafficcontroller or log-cache). Neither bbs nor trafficcontroller/log-cache have a durable database so I don't think they are even bbr aware to begin with. 😞

UAA has gotten good mileage from their sleep for what it's worth. 🤷‍♂️ Agree it doesn't feel the best, though...

https://github.com/cloudfoundry/uaa-release/blob/dd655638b44350a19f9a55bc2c29435dd7d12696/jobs/uaa/templates/bbr/post-restore-unlock.sh.erb#L8

@ljfranklin
Copy link
Contributor Author

@cwlbraa you can specify order dependencies with backup_should_be_locked_before: https://docs.cloudfoundry.org/bbr/bbr-devguide.html#job-configuration. But like Tim mentioned it might not help unless all components have BBR scripts.

@cwlbraa
Copy link
Contributor

cwlbraa commented Dec 15, 2018

do what works, i guess? ¯\(ツ)

@tcdowney
Copy link
Member

@ljfranklin

Or is this a Diego issue?

Do you happen to have logs from the api and diego-api VMs when this situation occurs? Thinking about it more, we're a bit surprised that BBS is unavailable given that it does not actually interact with bbr stuff. It's possible that contention on the internal MySQL Galera is affecting it's access to locket or it's non-durable database (maybe 😅)... but hard to tell without logs.

We want to make sure that we're adding a sleep for the right reason since the feedback cycle on these things can be pretty long.

@tcdowney
Copy link
Member

tcdowney commented Mar 27, 2019

We believe this PR addresses this issue, @ljfranklin:
#132

It doesn't actually address the Stats server temporarily unavailable error you may occasionally see during cf start/cf push, but that's unrelated to the BBR process and recent work such as switching to using log-cache and adding retry logic should help with the reliability of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants