server: please log any call to StopWorkspace with the reason why it is called #12283

sagor999 · 2022-08-22T21:18:30Z

Is your feature request related to a problem? Please describe

On workspace side we only see StopWorkspace is being called. But looking at webapp logs, there is nothing in logs that would explain why StopWorkspace was called.

Describe the behaviour you'd like

Can you please log a reason for anytime server is calling StopWorkspace to ease with debug.
For example, if user requested to stop, or if some internal server logic decided to call it (like cancel prebuild because there is a newer commit).

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

kylos101 · 2022-08-23T03:36:59Z

@sagor999 wdyt of adding traces too?

We might even be able to ensure that this particular trace is sampled at 100%. @mads-hartmann is it possible for us to configure certain traces such that they are always sampled, instead of at the default (I think 5 or 10% is the default for traces).

mads-hartmann · 2022-08-23T09:20:42Z

Ideally you would add both a log line and attach the information to the relevant span as the telemetry have different strengths depending on what kinds of questions you're trying to answer. For "system auditing" like this where we want to know why the system took a specific action against a specific workspace, logs are the best approach. If on the other hand you want to look at it more holistically, e.g. get a sense of what the distribution of "reasons" for stopping workspaces are, and you want to compute error rates, filter by cluster, etc. etc. then the tooling we have for traces are much better; and in that kind of analysis you really don't care about individual workspaces or requests, but the system as a whole.

I tried to capture some of this in When to use what telemetry but I haven't revised it since it was originally written so please comment if things are not clear 🧡

Ideally we wouldn't have to worry about logs vs traces as a span is really just a structured log event with a fixed schema that captures the hierarchy of operations. I will create an issue to create a PoC that would allow us to set an attribute on a span (e.g. audit_log) that ensure that such spans are always sampled. With that in place we would be able to use spans for audit like logs, but I'm not a sure it's the right approach as I still think log lines are required in order for our self-hosted customers to understand why the system is behaving the way it is ☺️

easyCZ · 2022-09-12T11:11:22Z

So taking a higher level view, there's no reason we wouldn't attach the Stop reason as a first class property in the request, but also into the underlying WorkpsaceInstance representation (in both WebApp side, but also ws-manager side). That way, the reason can be consumed by both sides (and presumably both have a strong interest in this for SLOs - customer triggered Stop is good, if the system triggers Stop for other reasons, that could be an indicator, or perhaps you ran out of credits and we're forcibly stopping your workspaces - both parties should know to take the appropriate action).

Something like

{
  ...
  stopRequested: bool
  stopReason: System readable reason
  stopDescription: Human readable reason (ie. You've run out of credits, please go and purchase more..)
}

The benefit of this is that you make it part of your domain model, which significantly easies your ability to debug - and to surface the reasons to the user.

sagor999 · 2022-09-12T15:02:28Z

@easyCZ great idea!

jankeromnes · 2022-09-13T08:44:43Z

For reference, here are all the locations I could find where we issue a StopWorkspaceRequest:

When stopping a workspace via the API

gitpod/components/server/src/workspace/gitpod-server-impl.ts

Lines 734 to 754 in b634fb3

    
           public async stopWorkspace(ctx: TraceContext, workspaceId: string): Promise<void> { 
        
               traceAPIParams(ctx, { workspaceId }); 
        
               traceWI(ctx, { workspaceId }); 
        
               const user = this.checkUser("stopWorkspace", undefined, { workspaceId }); 
        
               const logCtx = { userId: user.id, workspaceId }; 
        
               const workspace = await this.internalGetWorkspace(workspaceId, this.workspaceDb.trace(ctx)); 
        
               if (workspace.type === "prebuild") { 
        
                   // If this is a team prebuild, any team member can stop it. 
        
                   const teamMembers = await this.getTeamMembersByProject(workspace.projectId); 
        
                   await this.guardAccess({ kind: "workspace", subject: workspace, teamMembers }, "get"); 
        
               } else { 
        
                   // If this is not a prebuild, or it's a personal prebuild, only the workspace owner can stop it. 
        
                   await this.guardAccess({ kind: "workspace", subject: workspace }, "get"); 
        
               } 
        
               this.internalStopWorkspace(ctx, workspace).catch((err) => { 
        
                   log.error(logCtx, "stopWorkspace error: ", err); 
        
               }); 
        
           }

When deleting a workspace via the API

gitpod/components/server/src/workspace/gitpod-server-impl.ts

Lines 827 to 841 in b634fb3

    
           public async deleteWorkspace(ctx: TraceContext, workspaceId: string): Promise<void> { 
        
               traceAPIParams(ctx, { workspaceId }); 
        
               traceWI(ctx, { workspaceId }); 
        
               this.checkAndBlockUser("deleteWorkspace"); 
        
               const ws = await this.internalGetWorkspace(workspaceId, this.workspaceDb.trace(ctx)); 
        
               await this.guardAccess({ kind: "workspace", subject: ws }, "delete"); 
        
               // for good measure, try and stop running instances 
        
               await this.internalStopWorkspace(ctx, ws); 
        
               // actually delete the workspace 
        
               await this.workspaceDeletionService.softDeleteWorkspace(ctx, ws, "user"); 
        
           }

When cancelling a prebuild via the API

gitpod/components/server/ee/src/workspace/gitpod-server-impl.ts

Lines 2610 to 2627 in b634fb3

    
           async cancelPrebuild(ctx: TraceContext, projectId: string, prebuildId: string): Promise<void> { 
        
               traceAPIParams(ctx, { projectId, prebuildId }); 
        
               const user = this.checkAndBlockUser("cancelPrebuild"); 
        
               const project = await this.projectsService.getProject(projectId); 
        
               if (!project) { 
        
                   throw new ResponseError(ErrorCodes.NOT_FOUND, "Project not found"); 
        
               } 
        
               await this.guardProjectOperation(user, projectId, "update"); 
        
               const prebuild = await this.workspaceDb.trace(ctx).findPrebuildByID(prebuildId); 
        
               if (!prebuild) { 
        
                   throw new ResponseError(ErrorCodes.NOT_FOUND, "Prebuild not found"); 
        
               } 
        
               // Explicitly stopping the prebuild workspace now automaticaly cancels the prebuild 
        
               await this.stopWorkspace(ctx, prebuild.buildWorkspaceId); 
        
           }

When an admin force-stops a workspace form the admin panel

gitpod/components/server/ee/src/workspace/gitpod-server-impl.ts

Lines 858 to 869 in b634fb3

    
           async adminForceStopWorkspace(ctx: TraceContext, workspaceId: string): Promise<void> { 
        
               traceAPIParams(ctx, { workspaceId }); 
        
               await this.requireEELicense(Feature.FeatureAdminDashboard); 
        
               await this.guardAdminAccess("adminForceStopWorkspace", { id: workspaceId }, Permission.ADMIN_WORKSPACES); 
        
               const workspace = await this.workspaceDb.trace(ctx).findById(workspaceId); 
        
               if (workspace) { 
        
                   await this.internalStopWorkspace(ctx, workspace, StopWorkspacePolicy.IMMEDIATELY, true); 
        
               } 
        
           }

When we start a prebuild on a branch, we cancel all currently running prebuilds for that branch

gitpod/components/server/ee/src/prebuilds/prebuild-manager.ts

Lines 63 to 97 in 51c0c6b

    
           async abortPrebuildsForBranch(ctx: TraceContext, project: Project, user: User, branch: string): Promise<void> { 
        
               const span = TraceContext.startSpan("abortPrebuildsForBranch", ctx); 
        
               try { 
        
                   const prebuilds = await this.workspaceDB 
        
                       .trace({ span }) 
        
                       .findActivePrebuiltWorkspacesByBranch(project.id, branch); 
        
                   const results: Promise<any>[] = []; 
        
                   for (const prebuild of prebuilds) { 
        
                       try { 
        
                           for (const instance of prebuild.instances) { 
        
                               log.info( 
        
                                   { userId: user.id, instanceId: instance.id, workspaceId: instance.workspaceId }, 
        
                                   "Aborting Prebuild workspace because a newer commit was pushed to the same branch.", 
        
                               ); 
        
                               results.push( 
        
                                   this.workspaceStarter.stopWorkspaceInstance( 
        
                                       { span }, 
        
                                       instance.id, 
        
                                       instance.region, 
        
                                       StopWorkspacePolicy.ABORT, 
        
                                   ), 
        
                               ); 
        
                           } 
        
                           prebuild.prebuild.state = "aborted"; 
        
                           prebuild.prebuild.error = "A newer commit was pushed to the same branch."; 
        
                           results.push(this.workspaceDB.trace({ span }).storePrebuiltWorkspace(prebuild.prebuild)); 
        
                       } catch (err) { 
        
                           error("Cannot cancel prebuild", { prebuildID: prebuild.prebuild.id }, err); 
        
                       } 
        
                   } 
        
                   await Promise.all(results); 
        
               } finally { 
        
                   span.finish(); 
        
               } 
        
           }

When an admin blocks a user, we stop all their workspaces

gitpod/components/server/ee/src/workspace/gitpod-server-impl.ts

Lines 610 to 641 in b634fb3

    
           async adminBlockUser(ctx: TraceContext, req: AdminBlockUserRequest): Promise<User> { 
        
               traceAPIParams(ctx, { req }); 
        
               await this.requireEELicense(Feature.FeatureAdminDashboard); 
        
               await this.guardAdminAccess("adminBlockUser", { req }, Permission.ADMIN_USERS); 
        
               let targetUser; 
        
               try { 
        
                   targetUser = await this.userService.blockUser(req.id, req.blocked); 
        
               } catch (error) { 
        
                   throw new ResponseError(ErrorCodes.NOT_FOUND, "not found"); 
        
               } 
        
               const workspaceDb = this.workspaceDb.trace(ctx); 
        
               const workspaces = await workspaceDb.findWorkspacesByUser(req.id); 
        
               const isDefined = <T>(x: T | undefined): x is T => x !== undefined; 
        
               (await Promise.all(workspaces.map((workspace) => workspaceDb.findRunningInstance(workspace.id)))) 
        
                   .filter(isDefined) 
        
                   .forEach((instance) => 
        
                       this.workspaceStarter.stopWorkspaceInstance( 
        
                           ctx, 
        
                           instance.id, 
        
                           instance.region, 
        
                           StopWorkspacePolicy.IMMEDIATELY, 
        
                       ), 
        
                   ); 
        
               // For some reason, returning the result of `this.userDB.storeUser(target)` does not work. The response never arrives the caller. 
        
               // Returning `target` instead (which should be equivalent). 
        
               return this.censorUser(targetUser); 
        
           }

When a user deletes their account, we stop all their workspaces (and then delete them afterwards)

gitpod/components/server/src/user/user-deletion-service.ts

Lines 40 to 107 in 79b75ab

    
           async deleteUser(id: string): Promise<void> { 
        
               const user = await this.db.findUserById(id); 
        
               if (!user) { 
        
                   throw new Error(`No user with id ${id} found!`); 
        
               } 
        
               if (user.markedDeleted === true) { 
        
                   log.debug({ userId: id }, "Is deleted but markDeleted already set. Continuing."); 
        
               } 
        
               // Stop all workspaces 
        
               await this.stopWorkspaces(user); 
        
               // Auth Providers 
        
               const authProviders = await this.authProviderService.getAuthProvidersOfUser(user); 
        
               for (const provider of authProviders) { 
        
                   try { 
        
                       await this.authProviderService.deleteAuthProvider(provider); 
        
                   } catch (error) { 
        
                       log.error({ userId: id }, "Failed to delete user's auth provider.", error); 
        
                   } 
        
               } 
        
               // User 
        
               await this.db.transaction(async (db) => { 
        
                   this.anonymizeUser(user); 
        
                   this.deleteIdentities(user); 
        
                   await this.deleteTokens(db, user); 
        
                   user.lastVerificationTime = undefined; 
        
                   user.markedDeleted = true; 
        
                   await db.storeUser(user); 
        
               }); 
        
               await Promise.all([ 
        
                   // Workspace 
        
                   this.anonymizeAllWorkspaces(id), 
        
                   // UserStorageResourcesDB 
        
                   this.userStorageResourcesDb.deleteAllForUser(user.id), 
        
                   // Bucket 
        
                   this.deleteUserBucket(id), 
        
                   // Teams owned only by this user 
        
                   this.deleteSoleOwnedTeams(id), 
        
                   // Team memberships 
        
                   this.deleteTeamMemberships(id), 
        
                   // User projects 
        
                   this.deleteUserProjects(id), 
        
               ]); 
        
               // Track the deletion Event for Analytics Purposes 
        
               this.analytics.track({ 
        
                   userId: user.id, 
        
                   event: "deletion", 
        
                   properties: { 
        
                       deleted_at: new Date().toISOString(), 
        
                   }, 
        
               }); 
        
               this.analytics.identify({ 
        
                   userId: user.id, 
        
                   traits: { 
        
                       github_slug: "deleted-user", 
        
                       gitlab_slug: "deleted-user", 
        
                       bitbucket_slug: "deleted-user", 
        
                       email: "deleted-user", 
        
                       full_name: "deleted-user", 
        
                       name: "deleted-user", 
        
                   }, 
        
               }); 
        
           }

When a user runs out of credits (creditAlert) -- although I'm not sure if this is still used? 🤔

gitpod/components/server/ee/src/workspace/gitpod-server-impl.ts

Lines 222 to 251 in b634fb3

    
           protected listenToCreditAlerts(): void { 
        
               if (!this.user || !this.client) { 
        
                   return; 
        
               } 
        
               this.disposables.push( 
        
                   this.localMessageBroker.listenToCreditAlerts( 
        
                       this.user.id, 
        
                       (ctx: TraceContext, creditAlert: CreditAlert) => { 
        
                           TraceContext.withSpan( 
        
                               "forwardCreditAlertToClient", 
        
                               async (ctx) => { 
        
                                   traceClientMetadata(ctx, this.clientMetadata); 
        
                                   TraceContext.setJsonRPCMetadata(ctx, "onCreditAlert"); 
        
                                   this.client?.onCreditAlert(creditAlert); 
        
                                   if (creditAlert.remainingUsageHours < 1e-6) { 
        
                                       const runningInstances = await this.workspaceDb 
        
                                           .trace(ctx) 
        
                                           .findRegularRunningInstances(creditAlert.userId); 
        
                                       runningInstances.forEach( 
        
                                           async (instance) => await this.stopWorkspace(ctx, instance.workspaceId), 
        
                                       ); 
        
                                   } 
        
                               }, 
        
                               ctx, 
        
                           ); 
        
                       }, 
        
                   ), 
        
               ); 
        
           }

jankeromnes · 2022-09-13T09:53:58Z

Thanks for the discussion & great ideas here!

In #12906 I've added a reason field to every StopWorkspaceRequest we issue, of type StopWorkspaceReason (a new enum).

Does this look useful?
Are the specified reasons helpful?
Should this reason also be logged and/or added to tracing?

jankeromnes · 2022-09-14T12:32:02Z

Thanks again for the StopWorkspaceReason idea @easyCZ!

I gave this a first shot in #12906, and started with the following reasons:

enum StopWorkspaceReason {
    // The user stopped the workspace
    USER_STOPPED = 0;

    // The user deleted the workspace
    USER_DELETED = 1;

    // The user cancelled the running prebuild
    USER_CANCELLED_PREBUILD = 2;

    // An admin force-stopped the workspace
    ADMIN_STOPPED = 3;

    // An admin blocked the user
    ADMIN_BLOCKED = 4;

    // A new commit was pushed to a repository branch, so we cancelled all previously running prebuilds for that branch
    SYSTEM_CANCELLED_BRANCH_PREBUILD = 5;

    // The user exceeded their allowed usage limit (e.g. number of hours in month), so we stopped the workspace
    SYSTEM_USER_LIMIT_REACHED = 6;
}

However, after challenging this a bit, it still feels underspecified, so more discussion & alignment is likely needed before we can actually make such a change to the high-stakes StopWorkspace protocol.

Meanwhile (and possibly in parallel to the discussion above), maybe let's re-focus on the actual problem at hand, which was a lack of logging/tracing to help debug why workspaces are being stopped.

For that I can easily add logs & traces in a separate PR, given that we already know all the call sites.

sagor999 added this to 🍎 WebApp Team Aug 22, 2022

geropl moved this to Scheduled in 🍎 WebApp Team Sep 9, 2022

geropl removed the status in 🍎 WebApp Team Sep 12, 2022

geropl moved this to Scheduled in 🍎 WebApp Team Sep 12, 2022

jankeromnes self-assigned this Sep 12, 2022

jankeromnes mentioned this issue Sep 12, 2022

[ws-manager-bridge] Garbage-collect workspace instances whose workspace clusters are not available anymore #6770

Open

jankeromnes moved this from Scheduled to In Progress in 🍎 WebApp Team Sep 13, 2022

jankeromnes mentioned this issue Sep 13, 2022

Add an explicit StopWorkspaceReason to all StopWorkspaceRequests #12906

Closed

1 task

jankeromnes mentioned this issue Sep 14, 2022

[server] Every time we stop a workspace instance, log the reason why #12963

Merged

1 task

geropl mentioned this issue Sep 15, 2022

StopWorkspace is called prior to regular workspace transitioning to running state #12282

Closed

roboquat closed this as completed in #12963 Sep 16, 2022

Repository owner moved this from In Progress to Done in 🍎 WebApp Team Sep 16, 2022

kylos101 mentioned this issue Sep 20, 2022

Unexpected error loading prebuild #11852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: please log any call to StopWorkspace with the reason why it is called #12283

server: please log any call to StopWorkspace with the reason why it is called #12283

sagor999 commented Aug 22, 2022

kylos101 commented Aug 23, 2022

mads-hartmann commented Aug 23, 2022

easyCZ commented Sep 12, 2022

sagor999 commented Sep 12, 2022

jankeromnes commented Sep 13, 2022 •

edited

Loading

jankeromnes commented Sep 13, 2022

jankeromnes commented Sep 14, 2022

server: please log any call to StopWorkspace with the reason why it is called #12283

server: please log any call to StopWorkspace with the reason why it is called #12283

Comments

sagor999 commented Aug 22, 2022

Is your feature request related to a problem? Please describe

Describe the behaviour you'd like

Describe alternatives you've considered

Additional context

kylos101 commented Aug 23, 2022

mads-hartmann commented Aug 23, 2022

easyCZ commented Sep 12, 2022

sagor999 commented Sep 12, 2022

jankeromnes commented Sep 13, 2022 • edited Loading

jankeromnes commented Sep 13, 2022

jankeromnes commented Sep 14, 2022

jankeromnes commented Sep 13, 2022 •

edited

Loading