Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that cc ng waits to populate routes - second try #132

Merged
merged 1 commit into from
Mar 25, 2019
Merged

Ensure that cc ng waits to populate routes - second try #132

merged 1 commit into from
Mar 25, 2019

Conversation

rowanjacobs
Copy link
Contributor

Improve PDRATs reliability by improving the way autoscaling, usage-service and notifications block on CAPI availability in their BBR scripts

Signed-off-by: Slawek Ligus [email protected]

Thanks for contributing to the capi_release. To speed up the process of reviewing your pull request please provide us with:

  • A short explanation of the proposed change:

With this change, the unlock scripts for CC ng job will wait (60 seconds) until CC is probably back up.

This is similar to our previous PR, #131, but in this PR we sleep instead of attempting to check if CC is back up. This skips all the weirdness around maintenance mode.

  • An explanation of the use cases your change solves

As a result of this change the subsequent BBR jobs will not hit a 404/502 error when trying to target CC API endpoint.

  • Links to any other associated PRs

This is an updated version of #131.

  • I have viewed signed and have submitted the Contributor License Agreement

  • I have made this pull request to the develop branch

  • I have run CF Acceptance Tests on bosh lite - we have not done this because CATS does not cover this use case. We have run DRATS on a deployment of PAS 2.4 (HA) and SRT 2.4 (single CC VM) on GCP.

cc @oozie @pivotaljohn / @mcwumbly

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/164745558

The labels on this github issue will be updated when the story is started.

@cfdreddbot
Copy link

✅ Hey rowanjacobs! The commit authors and yourself have already signed the CLA.

@selzoc
Copy link
Member

selzoc commented Mar 20, 2019

A little worried that we don't check that cc has come up at all, we just wait 60 seconds and hope.

Maybe we could add a localhost check before the sleep, like a less-fancy version of the route registrar healthcheck script: https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng/templates/cloud_controller_api_health_check.erb

@tcdowney
Copy link
Member

tcdowney commented Mar 22, 2019

Yeah I think we should definitely keep in the wait_for_server_to_become_healthy check (probably on the internal address even to ensure we're hitting this particular Cloud Controller) and then add in the sleep afterward to give time for the route to register with the gorouters.

I actually think switching it to hitting the internal address will help a lot, since hitting the external might make followup Cloud Controllers unlock too early.

I'm also wondering if 60 seconds is too long. For an environment with just a few Cloud Controllers a few extra minutes is nothing, but for an environment with n Cloud Controllers we're saying that bbr will take n minutes longer. I know the BBR team is concerned with operators not running backups because of the time they take (leading to compromises like selective backups) so it's something we'll want to consider. I think since the route_registrar health checks us every six seconds, 6 should be the absolute minimum but I feel like we could get away with a max of 30.

Also noodling a bit on how #154448217 plays into all of this... 🤔 It may have inadvertently made it worse.. or maybe not since I believe dependent bbr unlocks should wait til everything claims its unlocked regardless of routability.

[#164409886] **[Feature Improvement]** improve PDRATs reliability by improving the way autoscaling, usage-service and notifications block on CAPI availability in their BBR scripts

Signed-off-by: Slawek Ligus <[email protected]>
@tcdowney
Copy link
Member

tcdowney commented Mar 25, 2019

We discussed this over a call with @oozie and @selzoc on Friday. Plan is to continue calling wait_for_server_to_become_healthy but against localhost to ensure that each API is health checking itself during unlock and to add a smaller sleep somewhere between 6-30 seconds (min and max estimated times for route propagation).

@oozie
Copy link
Contributor

oozie commented Mar 25, 2019

Thanks for the the meeting on Friday, @tcdowney and @selzoc;

Suggestions applied. Before the change, running DRATs resulted in a failure on the first try for both deployments.
With this change we observed 20+ consecutive successes on both normal and small footprint deployments.

@selzoc selzoc merged commit 9ae26a3 into cloudfoundry:develop Mar 25, 2019
MerricdeLauney added a commit to evanfarrar/capi-release that referenced this pull request Mar 15, 2023
Putting the sleep back in after reviewing this [conversation](cloudfoundry#132). It seems to be have been added to allow time for route propagation
MerricdeLauney added a commit that referenced this pull request Mar 15, 2023
* Update post-backup-unlock.sh.erb

If the timeout fails then the workers never get started, but monit will eventually restart the web process if the CF install eventually recovers, leaving a VM that is half working (with an unhealthy bosh state) after the script runs.

We could also change the exit behavior of the time out with `set +x` (or is it e? I forget), but it would seem that the only point of timing out is to alert the operator to a possible issue since the CC API can still restart.

* Update post-backup-unlock.sh.erb

Putting the sleep back in after reviewing this [conversation](#132). It seems to be have been added to allow time for route propagation

---------

Co-authored-by: MerricdeLauney <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants