Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: please log any call to StopWorkspace with the reason why it is called #12283

Closed
sagor999 opened this issue Aug 22, 2022 · 7 comments · Fixed by #12963
Closed

server: please log any call to StopWorkspace with the reason why it is called #12283

sagor999 opened this issue Aug 22, 2022 · 7 comments · Fixed by #12963
Assignees
Labels
component: server size/XS type: improvement Improves an existing feature or existing code

Comments

@sagor999
Copy link
Contributor

Is your feature request related to a problem? Please describe

On workspace side we only see StopWorkspace is being called. But looking at webapp logs, there is nothing in logs that would explain why StopWorkspace was called.

Describe the behaviour you'd like

Can you please log a reason for anytime server is calling StopWorkspace to ease with debug.
For example, if user requested to stop, or if some internal server logic decided to call it (like cancel prebuild because there is a newer commit).

Describe alternatives you've considered

Additional context

@kylos101
Copy link
Contributor

@sagor999 wdyt of adding traces too?

We might even be able to ensure that this particular trace is sampled at 100%. @mads-hartmann is it possible for us to configure certain traces such that they are always sampled, instead of at the default (I think 5 or 10% is the default for traces).

@mads-hartmann
Copy link
Contributor

Ideally you would add both a log line and attach the information to the relevant span as the telemetry have different strengths depending on what kinds of questions you're trying to answer. For "system auditing" like this where we want to know why the system took a specific action against a specific workspace, logs are the best approach. If on the other hand you want to look at it more holistically, e.g. get a sense of what the distribution of "reasons" for stopping workspaces are, and you want to compute error rates, filter by cluster, etc. etc. then the tooling we have for traces are much better; and in that kind of analysis you really don't care about individual workspaces or requests, but the system as a whole.

I tried to capture some of this in When to use what telemetry but I haven't revised it since it was originally written so please comment if things are not clear 🧡

Ideally we wouldn't have to worry about logs vs traces as a span is really just a structured log event with a fixed schema that captures the hierarchy of operations. I will create an issue to create a PoC that would allow us to set an attribute on a span (e.g. audit_log) that ensure that such spans are always sampled. With that in place we would be able to use spans for audit like logs, but I'm not a sure it's the right approach as I still think log lines are required in order for our self-hosted customers to understand why the system is behaving the way it is ☺️

@geropl geropl moved this to Scheduled in 🍎 WebApp Team Sep 9, 2022
@geropl geropl removed the status in 🍎 WebApp Team Sep 12, 2022
@geropl geropl moved this to Scheduled in 🍎 WebApp Team Sep 12, 2022
@jankeromnes jankeromnes self-assigned this Sep 12, 2022
@easyCZ
Copy link
Member

easyCZ commented Sep 12, 2022

So taking a higher level view, there's no reason we wouldn't attach the Stop reason as a first class property in the request, but also into the underlying WorkpsaceInstance representation (in both WebApp side, but also ws-manager side). That way, the reason can be consumed by both sides (and presumably both have a strong interest in this for SLOs - customer triggered Stop is good, if the system triggers Stop for other reasons, that could be an indicator, or perhaps you ran out of credits and we're forcibly stopping your workspaces - both parties should know to take the appropriate action).

Something like

{
  ...
  stopRequested: bool
  stopReason: System readable reason
  stopDescription: Human readable reason (ie. You've run out of credits, please go and purchase more..)
}

The benefit of this is that you make it part of your domain model, which significantly easies your ability to debug - and to surface the reasons to the user.

@sagor999
Copy link
Contributor Author

@easyCZ great idea!

@jankeromnes
Copy link
Contributor

jankeromnes commented Sep 13, 2022

For reference, here are all the locations I could find where we issue a StopWorkspaceRequest:

When stopping a workspace via the API

public async stopWorkspace(ctx: TraceContext, workspaceId: string): Promise<void> {
traceAPIParams(ctx, { workspaceId });
traceWI(ctx, { workspaceId });
const user = this.checkUser("stopWorkspace", undefined, { workspaceId });
const logCtx = { userId: user.id, workspaceId };
const workspace = await this.internalGetWorkspace(workspaceId, this.workspaceDb.trace(ctx));
if (workspace.type === "prebuild") {
// If this is a team prebuild, any team member can stop it.
const teamMembers = await this.getTeamMembersByProject(workspace.projectId);
await this.guardAccess({ kind: "workspace", subject: workspace, teamMembers }, "get");
} else {
// If this is not a prebuild, or it's a personal prebuild, only the workspace owner can stop it.
await this.guardAccess({ kind: "workspace", subject: workspace }, "get");
}
this.internalStopWorkspace(ctx, workspace).catch((err) => {
log.error(logCtx, "stopWorkspace error: ", err);
});
}

When deleting a workspace via the API

public async deleteWorkspace(ctx: TraceContext, workspaceId: string): Promise<void> {
traceAPIParams(ctx, { workspaceId });
traceWI(ctx, { workspaceId });
this.checkAndBlockUser("deleteWorkspace");
const ws = await this.internalGetWorkspace(workspaceId, this.workspaceDb.trace(ctx));
await this.guardAccess({ kind: "workspace", subject: ws }, "delete");
// for good measure, try and stop running instances
await this.internalStopWorkspace(ctx, ws);
// actually delete the workspace
await this.workspaceDeletionService.softDeleteWorkspace(ctx, ws, "user");
}

When cancelling a prebuild via the API

async cancelPrebuild(ctx: TraceContext, projectId: string, prebuildId: string): Promise<void> {
traceAPIParams(ctx, { projectId, prebuildId });
const user = this.checkAndBlockUser("cancelPrebuild");
const project = await this.projectsService.getProject(projectId);
if (!project) {
throw new ResponseError(ErrorCodes.NOT_FOUND, "Project not found");
}
await this.guardProjectOperation(user, projectId, "update");
const prebuild = await this.workspaceDb.trace(ctx).findPrebuildByID(prebuildId);
if (!prebuild) {
throw new ResponseError(ErrorCodes.NOT_FOUND, "Prebuild not found");
}
// Explicitly stopping the prebuild workspace now automaticaly cancels the prebuild
await this.stopWorkspace(ctx, prebuild.buildWorkspaceId);
}

When an admin force-stops a workspace form the admin panel

async adminForceStopWorkspace(ctx: TraceContext, workspaceId: string): Promise<void> {
traceAPIParams(ctx, { workspaceId });
await this.requireEELicense(Feature.FeatureAdminDashboard);
await this.guardAdminAccess("adminForceStopWorkspace", { id: workspaceId }, Permission.ADMIN_WORKSPACES);
const workspace = await this.workspaceDb.trace(ctx).findById(workspaceId);
if (workspace) {
await this.internalStopWorkspace(ctx, workspace, StopWorkspacePolicy.IMMEDIATELY, true);
}
}

When we start a prebuild on a branch, we cancel all currently running prebuilds for that branch

async abortPrebuildsForBranch(ctx: TraceContext, project: Project, user: User, branch: string): Promise<void> {
const span = TraceContext.startSpan("abortPrebuildsForBranch", ctx);
try {
const prebuilds = await this.workspaceDB
.trace({ span })
.findActivePrebuiltWorkspacesByBranch(project.id, branch);
const results: Promise<any>[] = [];
for (const prebuild of prebuilds) {
try {
for (const instance of prebuild.instances) {
log.info(
{ userId: user.id, instanceId: instance.id, workspaceId: instance.workspaceId },
"Aborting Prebuild workspace because a newer commit was pushed to the same branch.",
);
results.push(
this.workspaceStarter.stopWorkspaceInstance(
{ span },
instance.id,
instance.region,
StopWorkspacePolicy.ABORT,
),
);
}
prebuild.prebuild.state = "aborted";
prebuild.prebuild.error = "A newer commit was pushed to the same branch.";
results.push(this.workspaceDB.trace({ span }).storePrebuiltWorkspace(prebuild.prebuild));
} catch (err) {
error("Cannot cancel prebuild", { prebuildID: prebuild.prebuild.id }, err);
}
}
await Promise.all(results);
} finally {
span.finish();
}
}

When an admin blocks a user, we stop all their workspaces

async adminBlockUser(ctx: TraceContext, req: AdminBlockUserRequest): Promise<User> {
traceAPIParams(ctx, { req });
await this.requireEELicense(Feature.FeatureAdminDashboard);
await this.guardAdminAccess("adminBlockUser", { req }, Permission.ADMIN_USERS);
let targetUser;
try {
targetUser = await this.userService.blockUser(req.id, req.blocked);
} catch (error) {
throw new ResponseError(ErrorCodes.NOT_FOUND, "not found");
}
const workspaceDb = this.workspaceDb.trace(ctx);
const workspaces = await workspaceDb.findWorkspacesByUser(req.id);
const isDefined = <T>(x: T | undefined): x is T => x !== undefined;
(await Promise.all(workspaces.map((workspace) => workspaceDb.findRunningInstance(workspace.id))))
.filter(isDefined)
.forEach((instance) =>
this.workspaceStarter.stopWorkspaceInstance(
ctx,
instance.id,
instance.region,
StopWorkspacePolicy.IMMEDIATELY,
),
);
// For some reason, returning the result of `this.userDB.storeUser(target)` does not work. The response never arrives the caller.
// Returning `target` instead (which should be equivalent).
return this.censorUser(targetUser);
}

When a user deletes their account, we stop all their workspaces (and then delete them afterwards)

async deleteUser(id: string): Promise<void> {
const user = await this.db.findUserById(id);
if (!user) {
throw new Error(`No user with id ${id} found!`);
}
if (user.markedDeleted === true) {
log.debug({ userId: id }, "Is deleted but markDeleted already set. Continuing.");
}
// Stop all workspaces
await this.stopWorkspaces(user);
// Auth Providers
const authProviders = await this.authProviderService.getAuthProvidersOfUser(user);
for (const provider of authProviders) {
try {
await this.authProviderService.deleteAuthProvider(provider);
} catch (error) {
log.error({ userId: id }, "Failed to delete user's auth provider.", error);
}
}
// User
await this.db.transaction(async (db) => {
this.anonymizeUser(user);
this.deleteIdentities(user);
await this.deleteTokens(db, user);
user.lastVerificationTime = undefined;
user.markedDeleted = true;
await db.storeUser(user);
});
await Promise.all([
// Workspace
this.anonymizeAllWorkspaces(id),
// UserStorageResourcesDB
this.userStorageResourcesDb.deleteAllForUser(user.id),
// Bucket
this.deleteUserBucket(id),
// Teams owned only by this user
this.deleteSoleOwnedTeams(id),
// Team memberships
this.deleteTeamMemberships(id),
// User projects
this.deleteUserProjects(id),
]);
// Track the deletion Event for Analytics Purposes
this.analytics.track({
userId: user.id,
event: "deletion",
properties: {
deleted_at: new Date().toISOString(),
},
});
this.analytics.identify({
userId: user.id,
traits: {
github_slug: "deleted-user",
gitlab_slug: "deleted-user",
bitbucket_slug: "deleted-user",
email: "deleted-user",
full_name: "deleted-user",
name: "deleted-user",
},
});
}

When a user runs out of credits (creditAlert) -- although I'm not sure if this is still used? 🤔

protected listenToCreditAlerts(): void {
if (!this.user || !this.client) {
return;
}
this.disposables.push(
this.localMessageBroker.listenToCreditAlerts(
this.user.id,
(ctx: TraceContext, creditAlert: CreditAlert) => {
TraceContext.withSpan(
"forwardCreditAlertToClient",
async (ctx) => {
traceClientMetadata(ctx, this.clientMetadata);
TraceContext.setJsonRPCMetadata(ctx, "onCreditAlert");
this.client?.onCreditAlert(creditAlert);
if (creditAlert.remainingUsageHours < 1e-6) {
const runningInstances = await this.workspaceDb
.trace(ctx)
.findRegularRunningInstances(creditAlert.userId);
runningInstances.forEach(
async (instance) => await this.stopWorkspace(ctx, instance.workspaceId),
);
}
},
ctx,
);
},
),
);
}

@jankeromnes
Copy link
Contributor

Thanks for the discussion & great ideas here!

In #12906 I've added a reason field to every StopWorkspaceRequest we issue, of type StopWorkspaceReason (a new enum).

  • Does this look useful?
  • Are the specified reasons helpful?
  • Should this reason also be logged and/or added to tracing?

@jankeromnes
Copy link
Contributor

Thanks again for the StopWorkspaceReason idea @easyCZ!

I gave this a first shot in #12906, and started with the following reasons:

enum StopWorkspaceReason {
    // The user stopped the workspace
    USER_STOPPED = 0;

    // The user deleted the workspace
    USER_DELETED = 1;

    // The user cancelled the running prebuild
    USER_CANCELLED_PREBUILD = 2;

    // An admin force-stopped the workspace
    ADMIN_STOPPED = 3;

    // An admin blocked the user
    ADMIN_BLOCKED = 4;

    // A new commit was pushed to a repository branch, so we cancelled all previously running prebuilds for that branch
    SYSTEM_CANCELLED_BRANCH_PREBUILD = 5;

    // The user exceeded their allowed usage limit (e.g. number of hours in month), so we stopped the workspace
    SYSTEM_USER_LIMIT_REACHED = 6;
}

However, after challenging this a bit, it still feels underspecified, so more discussion & alignment is likely needed before we can actually make such a change to the high-stakes StopWorkspace protocol.

Meanwhile (and possibly in parallel to the discussion above), maybe let's re-focus on the actual problem at hand, which was a lack of logging/tracing to help debug why workspaces are being stopped.

For that I can easily add logs & traces in a separate PR, given that we already know all the call sites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: server size/XS type: improvement Improves an existing feature or existing code
Projects
Status: Done
5 participants