Rolling deployment: cancel+restart can lead to broken app #3019

stephanme · 2022-10-12T14:08:53Z

Issue

If a failing rolling deployment is canceled and immediately a new rolling deployment is triggered (see example below), then wrong application web processes are terminated which leads to a broken/unavailable app even though it was healthy before.

Rolling deployments should not lead to an unavailable application in case the new app version can't be started.

Context

cf-deployment v21.11.0
capi-release

D047883@VR9NDXJ7JP testapp % cf version
cf version 8.5.0+73aa161.2022-09-12

This is probably also the root cause for cli #2257.

Steps to Reproduce

# current dir contains an index.html file
# step 1
cf push testapp -m 128M -b binary_buildpack -c "python3 -m http.server 8080"
# app is running healthy

# break the app by a too large env var
cf set-env testapp too-large-env $(printf '%*s' 200000 | tr ' ' x)

# rolling restart (won't succeed)
# step 2
cf restart testapp --strategy rolling

# cancel deployment while step 2 is running and immediately restart 
# step 3
cf cancel-deployment testapp && cf restart testapp --strategy rolling

# result: app is unavailable

Current result

App is unavailable because the healthy web process from step1 was stopped.

By analysing CC logs (of api and scheduler VMs) and reading the CC coding:

step 1 (initial push) creates the app with process1 running 1 healthy instance (which uses the app guid as process guid for historic reasons)
step 2 (rolling restart after assigning bad env var) creates deployment1 which in turn creates process2 (no instance yet)
scheduler:cc_deployment_updater > scale_deployment for deployment1 tries to create one instance for process2 which fails, this repeats
step 3 (cf cancel-deployment testapp && cf restart testapp --strategy rolling):
- cancel-deployment marks deployment1 as CANCELING
- restart creates deployment2 and process3 (no instance yet)
scheduler:cc_deployment_updater > scale_deployment for deployment2 fails when trying to get the app lock
- not yet fully understand why, maybe due to parallel scale or cancel for d1
- anyway, it doesn't matter why scale_deployment for deployment2 fails -> process3 won't start and cc_deployment_updater retries until canceled
scheduler:cc_deployment_updater > cancel_deployment for deployment1
- finds bad prio_web_process=process3, should have found process1
- related CC coding, process3 != d1.deploying_web_process (=process2) and newer than process1
- cancel_deployment stops process1 (the healthy one) and process2 (which it should stop) and keeps process3 (which was not yet started and will never run) -> app is down

Expected result

App remains available even though all rolling deployment attempts fail.

cancel_deployment for deployment1 shall not delete process1 but only process2 (the new process of deployment1 which never got healthy)
(acceptable result IMHO): rolling restart in step 2 could have failed without creating deployment2

Possible Fix

Just ideas, need more discussion:

ensure that there is only one active deployment at a time, i.e. cf restart testapp --strategy rolling of step 2 should fail fast without creating deployment2 since deployment1 is still in progress (CANCELING)
more elaborated calculation of prio_web_process that can handle multiple active deployments (e.g. deployment model could store the prior_web_process when it gets created), unclear if this can be done in a waterproof way if multiple active deployments are allowed

The text was updated successfully, but these errors were encountered:

philippthun · 2022-10-19T15:40:48Z

While the use case of a "deployment train" (i.e. having multiple running deployments in parallel) is totally valid, allowing a CANCELING and a newer DEPLOYING deployment in parallel results in unexpected and potentially non-deterministic results (as can be seen in the issue description).

From my point of view we should simply disallow this, i.e. in case there is a CANCELING deployment being processed, no further deployment can be created until the previous deployment gets into the CANCELED state.

This is changed in PR #3026.

philippthun · 2022-10-27T09:05:09Z

@Gerg - Would this (comment above) make sense from your point of view? Or do you think that developers expect that a CANCELING and a DEPLOYING deployment should be possible in parallel?

Gerg · 2022-10-27T14:11:29Z

I'll try to spend some time thinking about what the desired behavior should be here.

Ideally, I'd like to be in a place where cf push isn't blocked by other async operations, whenever possible. We do currently block on builds, so we don't always achieve this objective.

Gerg · 2022-10-27T15:54:20Z

This feels like a bug to me:

scheduler:cc_deployment_updater > cancel_deployment for deployment1

finds bad prio_web_process=process3, should have found process1

related CC coding, process3 != d1.deploying_web_process (=process2) and newer than process1

cancel_deployment stops process1 (the healthy one) and process2 (which it should stop) and keeps process3 (which was not yet started and will never run) -> app is down

Would this be resolved by also checking that the prior web process is not newer than the deploying web process?

Something like:

          prior_web_process = web_processes.
                              reject { |p| p.guid == deploying_web_process.guid }.
                              reject { |p| p.created_at > deploying_web_process.created_at }.
                              max_by(&:created_at)

For context, here is the story that initially implemented this behavior: https://www.pivotaltracker.com/story/show/160208638

philippthun · 2022-10-27T18:51:00Z

Fixing the logic to get the prior web process is one thing, the other would be to keep the newer web processes - currently everything except the prior one is destroyed.

But what should happen when both 'active' (canceling + deploying) deployments are processed at the same time - i.e. in a single run of the DeploymentUpdater. This is what could happen here, right? And the order is not defined, or?

What about the following idea: if there are two 'active' (canceling + deploying) deployments for the same app being processed at the same time (single run of the DeploymentUpdater), the CANCELING deployment is immediately set to CANCELED (reason: SUPERSEDED) and not processed anymore (removed from the deployments_to_cancel) and the DEPLOYING deployment is executed and scales the processes accordingly.

philippthun · 2022-11-22T14:56:37Z

Draft PR: #3072

philippthun · 2022-11-29T16:33:55Z

I've finalized PR #3072. This fixes the problem described in this issue. Please refer to the details in the PR description/commit messages as well as the PR comments (here and here).

cf-gitbot added the unscheduled label Oct 12, 2022

philippthun self-assigned this Oct 19, 2022

philippthun mentioned this issue Oct 19, 2022

Prevent deployment creation while previous deployment is being canceled #3026

Closed

5 tasks

philippthun mentioned this issue Nov 29, 2022

Canceling deployment is superseded by new deployment #3072

Merged

5 tasks

philippthun linked a pull request Nov 29, 2022 that will close this issue

Canceling deployment is superseded by new deployment #3072

Merged

5 tasks

philippthun closed this as completed in #3072 Jan 16, 2023

cf-gitbot removed the unscheduled label Jan 16, 2023

stephanme mentioned this issue Jun 19, 2023

CF restart --rolling can cause a healthy app to vanish (downtime) cloudfoundry/cli#2257

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling deployment: cancel+restart can lead to broken app #3019

Rolling deployment: cancel+restart can lead to broken app #3019

stephanme commented Oct 12, 2022

philippthun commented Oct 19, 2022

philippthun commented Oct 27, 2022

Gerg commented Oct 27, 2022

Gerg commented Oct 27, 2022 •

edited

Loading

philippthun commented Oct 27, 2022

philippthun commented Nov 22, 2022

philippthun commented Nov 29, 2022

Rolling deployment: cancel+restart can lead to broken app #3019

Rolling deployment: cancel+restart can lead to broken app #3019

Comments

stephanme commented Oct 12, 2022

Issue

Context

Steps to Reproduce

Current result

Expected result

Possible Fix

philippthun commented Oct 19, 2022

philippthun commented Oct 27, 2022

Gerg commented Oct 27, 2022

Gerg commented Oct 27, 2022 • edited Loading

philippthun commented Oct 27, 2022

philippthun commented Nov 22, 2022

philippthun commented Nov 29, 2022

Gerg commented Oct 27, 2022 •

edited

Loading