Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable draining csb via os signal handling #1096

Closed
wants to merge 9 commits into from

Conversation

nouseforaname
Copy link
Contributor

Checklist:

  • Have you added or updated tests to validate the changed functionality?
  • Have you added Release Notes in the docs repositories?
  • Have you followed the Conventional Commits specification?

ifindlay-cci and others added 7 commits September 3, 2024 14:02
When running as an app in CF we can rely on the platform to handle TLS setup, but on a VM currently there is no way to have encrypted traffic.

TPCF-26820
it is angry about  `.` imports. But we do not mind because this is not
production code.
when a csb app that is running in cf is stopped outside of it's own
lifecycle ( e.g. the diego cell is redeployed) we do not have a great
way of ensuring that all in flight terraform executions will be able
to finish their work and write back the resulting tf state to the csb DB.

Diego assumes that an app will gracefully shutdown within 10s of receiving
SIGTERM, if that is not the case, the App will receive a SIGKILL and stop
abruptly

That creates orphaned resources in the underlying IaaS that cannot be
cleaned up by the csb because the CSB does not have the tfstate for the
terraform resources that were in flight when the CSB got shutdown.

To aleviate this issue, this introduces a graceful shutdown sequence and
and lockfiles on disk ( to be consumed by a drain script ). This enables
to deploy the CSB as a workload on a bosh instance.

Instead of marking specific SI instances as failed, this ensures that the broker
will
- stop accepting new requests
- finish all in flight TF before shutdown.

The drain script can be kept simple by inspecting a folder. If that folder is
empty, it is safe to proceed to stop the CSB.

We also tried a drain script based on inspecting the processes running ( e.g. if
a tofu or provider binary is still being executed ). Though that seems potentially
unreliable ( since there could be time of check // time of use issues ) that falsely
suggest that everything is finished ( e.g. because we checked right between two invocations
of the provider / tofu binaries )

- fly-by:
  some structs got their fiels reordered to improve their memory footprint.
* extended inflight operation test to check deprovision
* removed focus test
* removed failing check for SIGTERM, in this case SIGKILL is sent - but not seen in log
A previous commit fixed an issue with LockFilesExist returning an inverted value. The existing drain wait code depended on this incorrect behaviour.
@nouseforaname nouseforaname marked this pull request as draft September 10, 2024 09:02
@FelisiaM FelisiaM changed the title Feat: enable draining csb via os signal handling feat: enable draining csb via os signal handling Sep 10, 2024
@nouseforaname nouseforaname marked this pull request as ready for review September 10, 2024 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants