Ensure that cc ng waits to populate routes - second try #132

rowanjacobs · 2019-03-19T20:45:46Z

Improve PDRATs reliability by improving the way autoscaling, usage-service and notifications block on CAPI availability in their BBR scripts

Signed-off-by: Slawek Ligus [email protected]

Thanks for contributing to the capi_release. To speed up the process of reviewing your pull request please provide us with:

A short explanation of the proposed change:

With this change, the unlock scripts for CC ng job will wait (60 seconds) until CC is probably back up.

This is similar to our previous PR, #131, but in this PR we sleep instead of attempting to check if CC is back up. This skips all the weirdness around maintenance mode.

An explanation of the use cases your change solves

As a result of this change the subsequent BBR jobs will not hit a 404/502 error when trying to target CC API endpoint.

Links to any other associated PRs

This is an updated version of #131.

I have viewed signed and have submitted the Contributor License Agreement
I have made this pull request to the develop branch
~~I have run CF Acceptance Tests on bosh lite~~ - we have not done this because CATS does not cover this use case. We have run DRATS on a deployment of PAS 2.4 (HA) and SRT 2.4 (single CC VM) on GCP.

cc @oozie @pivotaljohn / @mcwumbly

cf-gitbot · 2019-03-19T20:45:48Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/164745558

The labels on this github issue will be updated when the story is started.

cfdreddbot · 2019-03-19T20:45:49Z

✅ Hey rowanjacobs! The commit authors and yourself have already signed the CLA.

selzoc · 2019-03-20T21:58:52Z

A little worried that we don't check that cc has come up at all, we just wait 60 seconds and hope.

Maybe we could add a localhost check before the sleep, like a less-fancy version of the route registrar healthcheck script: https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng/templates/cloud_controller_api_health_check.erb

tcdowney · 2019-03-22T16:14:41Z

Yeah I think we should definitely keep in the wait_for_server_to_become_healthy check (probably on the internal address even to ensure we're hitting this particular Cloud Controller) and then add in the sleep afterward to give time for the route to register with the gorouters.

I actually think switching it to hitting the internal address will help a lot, since hitting the external might make followup Cloud Controllers unlock too early.

I'm also wondering if 60 seconds is too long. For an environment with just a few Cloud Controllers a few extra minutes is nothing, but for an environment with n Cloud Controllers we're saying that bbr will take n minutes longer. I know the BBR team is concerned with operators not running backups because of the time they take (leading to compromises like selective backups) so it's something we'll want to consider. I think since the route_registrar health checks us every six seconds, 6 should be the absolute minimum but I feel like we could get away with a max of 30.

Also noodling a bit on how #154448217 plays into all of this... 🤔 It may have inadvertently made it worse.. or maybe not since I believe dependent bbr unlocks should wait til everything claims its unlocked regardless of routability.

[#164409886] **[Feature Improvement]** improve PDRATs reliability by improving the way autoscaling, usage-service and notifications block on CAPI availability in their BBR scripts Signed-off-by: Slawek Ligus <[email protected]>

tcdowney · 2019-03-25T16:04:23Z

We discussed this over a call with @oozie and @selzoc on Friday. Plan is to continue calling wait_for_server_to_become_healthy but against localhost to ensure that each API is health checking itself during unlock and to add a smaller sleep somewhere between 6-30 seconds (min and max estimated times for route propagation).

oozie · 2019-03-25T17:34:59Z

Thanks for the the meeting on Friday, @tcdowney and @selzoc;

Suggestions applied. Before the change, running DRATs resulted in a failure on the first try for both deployments.
With this change we observed 20+ consecutive successes on both normal and small footprint deployments.

Putting the sleep back in after reviewing this [conversation](cloudfoundry#132). It seems to be have been added to allow time for route propagation

* Update post-backup-unlock.sh.erb If the timeout fails then the workers never get started, but monit will eventually restart the web process if the CF install eventually recovers, leaving a VM that is half working (with an unhealthy bosh state) after the script runs. We could also change the exit behavior of the time out with `set +x` (or is it e? I forget), but it would seem that the only point of timing out is to alert the operator to a possible issue since the CC API can still restart. * Update post-backup-unlock.sh.erb Putting the sleep back in after reviewing this [conversation](#132). It seems to be have been added to allow time for route propagation --------- Co-authored-by: MerricdeLauney <[email protected]>

cf-gitbot added the unscheduled label Mar 19, 2019

cf-gitbot added scheduled and removed unscheduled labels Mar 21, 2019

Check localhost for health and sleep for route propagation

50f77fd

[#164409886] **[Feature Improvement]** improve PDRATs reliability by improving the way autoscaling, usage-service and notifications block on CAPI availability in their BBR scripts Signed-off-by: Slawek Ligus <[email protected]>

selzoc merged commit 9ae26a3 into cloudfoundry:develop Mar 25, 2019

cf-gitbot added delivered accepted and removed scheduled delivered labels Mar 25, 2019

selzoc mentioned this pull request Mar 25, 2019

Ensure that cc ng waits until provided timeout to populate routes #131

Merged

2 tasks

tcdowney mentioned this pull request Mar 27, 2019

Intermittent "Stats server temporarily unavailable" after BBR unlock #115

Closed

MerricdeLauney added a commit to evanfarrar/capi-release that referenced this pull request Mar 15, 2023

Update post-backup-unlock.sh.erb

5e9b2bb

Putting the sleep back in after reviewing this [conversation](cloudfoundry#132). It seems to be have been added to allow time for route propagation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that cc ng waits to populate routes - second try #132

Ensure that cc ng waits to populate routes - second try #132

rowanjacobs commented Mar 19, 2019

cf-gitbot commented Mar 19, 2019

cfdreddbot commented Mar 19, 2019

selzoc commented Mar 20, 2019

tcdowney commented Mar 22, 2019 •

edited

Loading

tcdowney commented Mar 25, 2019 •

edited

Loading

oozie commented Mar 25, 2019

Ensure that cc ng waits to populate routes - second try #132

Ensure that cc ng waits to populate routes - second try #132

Conversation

rowanjacobs commented Mar 19, 2019

cf-gitbot commented Mar 19, 2019

cfdreddbot commented Mar 19, 2019

selzoc commented Mar 20, 2019

tcdowney commented Mar 22, 2019 • edited Loading

tcdowney commented Mar 25, 2019 • edited Loading

oozie commented Mar 25, 2019

tcdowney commented Mar 22, 2019 •

edited

Loading

tcdowney commented Mar 25, 2019 •

edited

Loading