Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling StopWorkspace on a workspace that is not fully up yet #11453

Closed
sagor999 opened this issue Jul 18, 2022 · 18 comments · Fixed by #12284 or #12386
Closed

Calling StopWorkspace on a workspace that is not fully up yet #11453

sagor999 opened this issue Jul 18, 2022 · 18 comments · Fixed by #12284 or #12386
Assignees
Labels
type: bug Something isn't working

Comments

@sagor999
Copy link
Contributor

sagor999 commented Jul 18, 2022

Bug description

Currently if calling StopWorkspace on a workspace that has not fully started up yet will cause that workspace state to become failed.
We need to better handle this case. This is currently used by webapp to stop already running prebuild if newer prebuild has started (commit is obsolete, etc).

It is negatively impacting our workspace success rate (~5-10% depending on prebuild volume).

Steps to reproduce

https://gitpod.slack.com/archives/C02EN94AEPL/p1657828853695599

Workspace affected

No response

Expected behavior

conditions object is a child of status column in d_b_workspace_instance table.

On the workspace side, for ws-manager:
Add a wasCancelled attribute to the conditions object, and inspect the field when setting the phase on stop workspace.

On the webapp side, for server:
If the wasCancelled is true, set the phase to stopped (instead of failed). For workspace start, we then have to make sure we do not try and start a new workspace instance from a prior workspace instance where its conditions have wasCancelled set to true.

Example repository

No response

Anything else?

No response

@sagor999 sagor999 added the type: bug Something isn't working label Jul 18, 2022
@sagor999 sagor999 moved this to Scheduled in 🌌 Workspace Team Jul 18, 2022
@kylos101
Copy link
Contributor

@sagor999 is this issue for prebuilds, or are you considering this to also include regular workspaces and image builds?

I'm going to assume prebuilds (at least for now).

It looks like we'd want the phase to be stopped, and failed to be considered false [1][2]. But that might be naive. For example, if we indicate the prebuild stopped and did not fail, that could signal that there is a valid backup to create a new workspace from (given the prebuild indicates it was successful) - and there may not be depending on how far the prebuild got.

We should get together later this week (after the deploy of gen55) to sort out a couple options/ideas.

@sagor999
Copy link
Contributor Author

@kylos101 yes, for prebuilds.

Good point about phase. 🤔
@geropl what state should be set on those prebuilds that were aborted? 🤔 I don't think we have "aborted" state and not sure if we want to add another state just for this case (seems like overkill?).
Want to ensure as Kyle mentioned that we don't attempt to use such prebuild later on.

@kylos101 kylos101 moved this from Scheduled to Backlog in 🌌 Workspace Team Aug 1, 2022
@kylos101
Copy link
Contributor

kylos101 commented Aug 5, 2022

@geropl Would it make sense to add a Cancelled phase? For example, to handle prebuilds which are interrupted because a newer commit triggered a webhook and subsequent prebuild.

@geropl
Copy link
Member

geropl commented Aug 8, 2022

@kylos101 I think closet to your current model - where we have stopped & condition.failed - would be a aborted/cancelled condition: we would have the same assumptions as for the former, but would clearly signal it did not fail, but got aborted.

About the "has valid backup" signal: Not sure it's out-of-scope here, or irrelevant with the PVC effort, but it would be awesome to have a clear, separate signal for that. A condition hasBackup, for instance.

@kylos101 kylos101 moved this from Breakdown to Scheduled in 🌌 Workspace Team Aug 11, 2022
@kylos101
Copy link
Contributor

@geropl we've updated the expected behavior and scheduled this, thank you for your input! 👍

@sagor999
Copy link
Contributor Author

@geropl would like to pick your brain in terms of implementation details of this from webapp perspective.
Currently we have this:

// StopWorkspaceRequest requests that the workspace manager stops a workspace
message StopWorkspaceRequest {
// ID is the unique identifier of the workspace to stop
string id = 1;
// Policy determines how quickly a workspace will be stopped
StopWorkspacePolicy policy = 2;
}

I would suggest to add additional StopWorkspacePolicy:
ABORT = 2

This policy would be similar to IMMEDIATELY, plus it would tell ws-manager that we don't want to attempt to run any finalizer steps on workspace, we don't care about backing up any data from that workspace, we just want to shutdown workspace as fast as possible.

You would then need to specify this policy when you are cancelling already running prebuild if there is a newer prebuild running already.

I will also add a condition to the workspace that would show that workspace was aborted\cancelled, so that UI wise we can signal this to the user. So workspace would be marked as failed, and will have wasAborted condition set to true as well. Why failed? To ensure that any existing code will filter that workspace out and will not attempt to use it to restore anything from it.

message WorkspaceConditions {

Will add this into that struct:

    // wasAborted is true if workspace was requested to stop and not save any data from it
    WorkspaceConditionBool wasAborted = 13;

WDYT?

@sagor999 sagor999 self-assigned this Aug 17, 2022
@geropl
Copy link
Member

geropl commented Aug 19, 2022

@sagor999 That sounds perfect, thank you! 💯 From that we can derive all states clearly, plus are backwards compatible. 🧘

☁️ One small nit/some 🚲 shedding: from the previous values I'd expect the condition to be called aborted. If you reference the StopWorkspacePolicy in the comment the meaning is crystal-clear I think.

@sagor999
Copy link
Contributor Author

@geropl I did my part in PR #12284
how do you want to proceed with webapp portion? do you want to add as a commit into my PR? so that we can test it as part of one PR? Or do you want to do it as a separate PR?

@kylos101 kylos101 moved this from Scheduled to In Progress in 🌌 Workspace Team Aug 23, 2022
@kylos101
Copy link
Contributor

@sagor999 moving from scheduled to in-progress as this has a draft PR 🙏

@geropl
Copy link
Member

geropl commented Aug 23, 2022

@sagor999 As far as I understood your PR is independent and backwards compatible. Do you agree?
If yes let's merge that PR, and we can start developing based on the types in the API package. 👍

@sagor999
Copy link
Contributor Author

@geropl yes, it is. sounds good, I will mark that PR as ready for review then.

Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Aug 23, 2022
@sagor999
Copy link
Contributor Author

re-open as this requires a fix from webapp team now.
adding to webapp inbox as well.

@kylos101
Copy link
Contributor

kylos101 commented Aug 23, 2022

@sagor999 what happens if webapp ships their change before #12284 lands? I ask because #12284 may not make it into gen63, as @jenting already started testing. Should we give webapp a heads up, to avoid scheduling and shipping until #12284 is live in workspace clusters (probably gen64)?

@sagor999
Copy link
Contributor Author

@kylos101 very good point. We do indeed need to wait for #12284 to be deployed first.

@kylos101
Copy link
Contributor

@kylos101 very good point. We do indeed need to wait for #12284 to be deployed first.

@geropl for awareness

@jenting
Copy link
Contributor

jenting commented Aug 24, 2022

I'm not sure if this issue is the right place for sharing.

But we can see that even running the loadgen on an ephemeral cluster, and we saw the backup failures per second.

image

@svenefftinge svenefftinge moved this to Scheduled in 🍎 WebApp Team Aug 25, 2022
@svenefftinge svenefftinge self-assigned this Aug 25, 2022
@svenefftinge svenefftinge moved this from Scheduled to In Progress in 🍎 WebApp Team Aug 25, 2022
@kylos101
Copy link
Contributor

@sagor999 I assume the thought is to leave this in Awaiting Deployment, until the webapp change ships, and then move this to in-validation?

@sagor999
Copy link
Contributor Author

@kylos101 yes.

@svenefftinge svenefftinge removed their assignment Aug 30, 2022
Repository owner moved this from In Progress to Done in 🍎 WebApp Team Sep 6, 2022
@jenting jenting moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Sep 8, 2022
@sagor999 sagor999 moved this from In Validation to Done in 🌌 Workspace Team Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Status: Done
Status: Done
5 participants