Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling deployment: cancel+restart can lead to broken app #3019

Closed
stephanme opened this issue Oct 12, 2022 · 7 comments · Fixed by #3072
Closed

Rolling deployment: cancel+restart can lead to broken app #3019

stephanme opened this issue Oct 12, 2022 · 7 comments · Fixed by #3072
Assignees

Comments

@stephanme
Copy link
Contributor

Issue

If a failing rolling deployment is canceled and immediately a new rolling deployment is triggered (see example below), then wrong application web processes are terminated which leads to a broken/unavailable app even though it was healthy before.

Rolling deployments should not lead to an unavailable application in case the new app version can't be started.

Context

cf-deployment v21.11.0
capi-release

D047883@VR9NDXJ7JP testapp % cf version
cf version 8.5.0+73aa161.2022-09-12

This is probably also the root cause for cli #2257.

Steps to Reproduce

# current dir contains an index.html file
# step 1
cf push testapp -m 128M -b binary_buildpack -c "python3 -m http.server 8080"
# app is running healthy

# break the app by a too large env var
cf set-env testapp too-large-env $(printf '%*s' 200000 | tr ' ' x)

# rolling restart (won't succeed)
# step 2
cf restart testapp --strategy rolling

# cancel deployment while step 2 is running and immediately restart 
# step 3
cf cancel-deployment testapp && cf restart testapp --strategy rolling

# result: app is unavailable

Current result

App is unavailable because the healthy web process from step1 was stopped.

By analysing CC logs (of api and scheduler VMs) and reading the CC coding:

  • step 1 (initial push) creates the app with process1 running 1 healthy instance (which uses the app guid as process guid for historic reasons)
  • step 2 (rolling restart after assigning bad env var) creates deployment1 which in turn creates process2 (no instance yet)
  • scheduler:cc_deployment_updater > scale_deployment for deployment1 tries to create one instance for process2 which fails, this repeats
  • step 3 (cf cancel-deployment testapp && cf restart testapp --strategy rolling):
    • cancel-deployment marks deployment1 as CANCELING
    • restart creates deployment2 and process3 (no instance yet)
  • scheduler:cc_deployment_updater > scale_deployment for deployment2 fails when trying to get the app lock
    • not yet fully understand why, maybe due to parallel scale or cancel for d1
    • anyway, it doesn't matter why scale_deployment for deployment2 fails -> process3 won't start and cc_deployment_updater retries until canceled
  • scheduler:cc_deployment_updater > cancel_deployment for deployment1
    • finds bad prio_web_process=process3, should have found process1
    • related CC coding, process3 != d1.deploying_web_process (=process2) and newer than process1
    • cancel_deployment stops process1 (the healthy one) and process2 (which it should stop) and keeps process3 (which was not yet started and will never run) -> app is down

Expected result

App remains available even though all rolling deployment attempts fail.

  • cancel_deployment for deployment1 shall not delete process1 but only process2 (the new process of deployment1 which never got healthy)
  • (acceptable result IMHO): rolling restart in step 2 could have failed without creating deployment2

Possible Fix

Just ideas, need more discussion:

  • ensure that there is only one active deployment at a time, i.e. cf restart testapp --strategy rolling of step 2 should fail fast without creating deployment2 since deployment1 is still in progress (CANCELING)
  • more elaborated calculation of prio_web_process that can handle multiple active deployments (e.g. deployment model could store the prior_web_process when it gets created), unclear if this can be done in a waterproof way if multiple active deployments are allowed
@philippthun
Copy link
Member

While the use case of a "deployment train" (i.e. having multiple running deployments in parallel) is totally valid, allowing a CANCELING and a newer DEPLOYING deployment in parallel results in unexpected and potentially non-deterministic results (as can be seen in the issue description).

From my point of view we should simply disallow this, i.e. in case there is a CANCELING deployment being processed, no further deployment can be created until the previous deployment gets into the CANCELED state.

This is changed in PR #3026.

@philippthun
Copy link
Member

@Gerg - Would this (comment above) make sense from your point of view? Or do you think that developers expect that a CANCELING and a DEPLOYING deployment should be possible in parallel?

@Gerg
Copy link
Member

Gerg commented Oct 27, 2022

I'll try to spend some time thinking about what the desired behavior should be here.

Ideally, I'd like to be in a place where cf push isn't blocked by other async operations, whenever possible. We do currently block on builds, so we don't always achieve this objective.

@Gerg
Copy link
Member

Gerg commented Oct 27, 2022

This feels like a bug to me:

  • scheduler:cc_deployment_updater > cancel_deployment for deployment1
    • finds bad prio_web_process=process3, should have found process1
    • related CC coding, process3 != d1.deploying_web_process (=process2) and newer than process1
    • cancel_deployment stops process1 (the healthy one) and process2 (which it should stop) and keeps process3 (which was not yet started and will never run) -> app is down

Would this be resolved by also checking that the prior web process is not newer than the deploying web process?

Something like:

          prior_web_process = web_processes.
                              reject { |p| p.guid == deploying_web_process.guid }.
                              reject { |p| p.created_at > deploying_web_process.created_at }.
                              max_by(&:created_at)

For context, here is the story that initially implemented this behavior: https://www.pivotaltracker.com/story/show/160208638

@philippthun
Copy link
Member

Fixing the logic to get the prior web process is one thing, the other would be to keep the newer web processes - currently everything except the prior one is destroyed.

But what should happen when both 'active' (canceling + deploying) deployments are processed at the same time - i.e. in a single run of the DeploymentUpdater. This is what could happen here, right? And the order is not defined, or?

What about the following idea: if there are two 'active' (canceling + deploying) deployments for the same app being processed at the same time (single run of the DeploymentUpdater), the CANCELING deployment is immediately set to CANCELED (reason: SUPERSEDED) and not processed anymore (removed from the deployments_to_cancel) and the DEPLOYING deployment is executed and scales the processes accordingly.

@philippthun
Copy link
Member

Draft PR: #3072

@philippthun
Copy link
Member

I've finalized PR #3072. This fixes the problem described in this issue. Please refer to the details in the PR description/commit messages as well as the PR comments (here and here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants