-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: please log any call to StopWorkspace with the reason why it is called #12283
Comments
@sagor999 wdyt of adding traces too? We might even be able to ensure that this particular trace is sampled at 100%. @mads-hartmann is it possible for us to configure certain traces such that they are always sampled, instead of at the default (I think 5 or 10% is the default for traces). |
Ideally you would add both a log line and attach the information to the relevant span as the telemetry have different strengths depending on what kinds of questions you're trying to answer. For "system auditing" like this where we want to know why the system took a specific action against a specific workspace, logs are the best approach. If on the other hand you want to look at it more holistically, e.g. get a sense of what the distribution of "reasons" for stopping workspaces are, and you want to compute error rates, filter by cluster, etc. etc. then the tooling we have for traces are much better; and in that kind of analysis you really don't care about individual workspaces or requests, but the system as a whole. I tried to capture some of this in When to use what telemetry but I haven't revised it since it was originally written so please comment if things are not clear 🧡 Ideally we wouldn't have to worry about logs vs traces as a span is really just a structured log event with a fixed schema that captures the hierarchy of operations. I will create an issue to create a PoC that would allow us to set an attribute on a span (e.g. audit_log) that ensure that such spans are always sampled. With that in place we would be able to use spans for audit like logs, but I'm not a sure it's the right approach as I still think log lines are required in order for our self-hosted customers to understand why the system is behaving the way it is |
So taking a higher level view, there's no reason we wouldn't attach the Stop reason as a first class property in the request, but also into the underlying WorkpsaceInstance representation (in both WebApp side, but also Something like
The benefit of this is that you make it part of your domain model, which significantly easies your ability to debug - and to surface the reasons to the user. |
@easyCZ great idea! |
For reference, here are all the locations I could find where we issue a When stopping a workspace via the APIgitpod/components/server/src/workspace/gitpod-server-impl.ts Lines 734 to 754 in b634fb3
When deleting a workspace via the APIgitpod/components/server/src/workspace/gitpod-server-impl.ts Lines 827 to 841 in b634fb3
When cancelling a prebuild via the APIgitpod/components/server/ee/src/workspace/gitpod-server-impl.ts Lines 2610 to 2627 in b634fb3
When an admin force-stops a workspace form the admin panelgitpod/components/server/ee/src/workspace/gitpod-server-impl.ts Lines 858 to 869 in b634fb3
When we start a prebuild on a branch, we cancel all currently running prebuilds for that branchgitpod/components/server/ee/src/prebuilds/prebuild-manager.ts Lines 63 to 97 in 51c0c6b
When an admin blocks a user, we stop all their workspacesgitpod/components/server/ee/src/workspace/gitpod-server-impl.ts Lines 610 to 641 in b634fb3
When a user deletes their account, we stop all their workspaces (and then delete them afterwards)gitpod/components/server/src/user/user-deletion-service.ts Lines 40 to 107 in 79b75ab
When a user runs out of credits (creditAlert) -- although I'm not sure if this is still used? 🤔gitpod/components/server/ee/src/workspace/gitpod-server-impl.ts Lines 222 to 251 in b634fb3
|
Thanks for the discussion & great ideas here! In #12906 I've added a
|
Thanks again for the I gave this a first shot in #12906, and started with the following reasons: enum StopWorkspaceReason {
// The user stopped the workspace
USER_STOPPED = 0;
// The user deleted the workspace
USER_DELETED = 1;
// The user cancelled the running prebuild
USER_CANCELLED_PREBUILD = 2;
// An admin force-stopped the workspace
ADMIN_STOPPED = 3;
// An admin blocked the user
ADMIN_BLOCKED = 4;
// A new commit was pushed to a repository branch, so we cancelled all previously running prebuilds for that branch
SYSTEM_CANCELLED_BRANCH_PREBUILD = 5;
// The user exceeded their allowed usage limit (e.g. number of hours in month), so we stopped the workspace
SYSTEM_USER_LIMIT_REACHED = 6;
} However, after challenging this a bit, it still feels underspecified, so more discussion & alignment is likely needed before we can actually make such a change to the high-stakes Meanwhile (and possibly in parallel to the discussion above), maybe let's re-focus on the actual problem at hand, which was a lack of logging/tracing to help debug why workspaces are being stopped. For that I can easily add logs & traces in a separate PR, given that we already know all the call sites. |
Is your feature request related to a problem? Please describe
On workspace side we only see
StopWorkspace
is being called. But looking at webapp logs, there is nothing in logs that would explain why StopWorkspace was called.Describe the behaviour you'd like
Can you please log a reason for anytime server is calling
StopWorkspace
to ease with debug.For example, if user requested to stop, or if some internal server logic decided to call it (like cancel prebuild because there is a newer commit).
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: