Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-manager-bridge] the number of workspace instances remaining seems to be wrong #11399

Closed
jenting opened this issue Jul 15, 2022 · 5 comments
Closed
Labels
team: webapp Issue belongs to the WebApp team type: bug Something isn't working

Comments

@jenting
Copy link
Contributor

jenting commented Jul 15, 2022

Bug description

We wanted to delete the us53 cluster, the number of workspace pod is empty, so we triggered the werft job to delete the us53 cluster. But somehow, the ws-manager-bridge reported the number of running instances is 2.

Updated property [core/project].
downloading sha256:c479b2288f1e17ae558d565dcd1d686765cf4b9c310a5c85d2bbcdc644074b91  100% || (21/21 MB, 51.481 MB/s)
Switched to context "us53".
🚀 Retrieving governing meta cluster prod-meta-us02 kubeconfig
Fetching cluster endpoint and auth data.
kubeconfig entry generated for prod-meta-us02.
🚀 Changing kubectx to prod-meta-us02
Switched to context "gke_gitpod-191109_us-west1_prod-meta-us02".
Context "gke_gitpod-191109_us-west1_prod-meta-us02" modified.
Active namespace is "default".
🚀 Deregistering cluster us53
time="2022-07-14T20:35:29Z" level=fatal msg="rpc error: code = Unknown desc = cluster is not empty (2 instances remaining)"

After that, we check the DB according to the code logic

  • const instances = await this.workspaceDB.findRegularRunningInstances();
  • public async findRegularRunningInstances(userId?: string): Promise<WorkspaceInstance[]> {
    const infos = await this.findRunningInstancesWithWorkspaces(undefined, userId);
    return infos.filter((info) => info.workspace.type === "regular").map((wsinfo) => wsinfo.latestInstance);
    }
    public async findRunningInstancesWithWorkspaces(
    installation?: string,
    userId?: string,
    includeStopping: boolean = false,
    ): Promise<RunningWorkspaceInfo[]> {
    const params: any = {};
    const conditions = ["wsi.phasePersisted != 'stopped'", "wsi.deleted != TRUE"];
    if (!includeStopping) {
    // This excludes instances in a 'stopping' phase
    conditions.push("wsi.phasePersisted != 'stopping'");
    }
    if (installation) {
    params.region = installation;
    conditions.push("wsi.region = :region");
    }
    const joinParams: any = {};
    const joinConditions = [];
    if (userId) {
    joinParams.userId = userId;
    joinConditions.push("ws.ownerId = :userId");
    }
    return this.doJoinInstanceWithWorkspace<RunningWorkspaceInfo>(
    conditions,
    params,
    joinConditions,
    joinParams,
    (wsi, ws) => {
    return { workspace: ws, latestInstance: wsi };
    },
    );
    }

The number of workspace according to the criteria using SQL query is 10, rather than 2.

We are not sure is it the bug of ws-manager-bridge or we input the wrong query parameters.

Steps to reproduce

Internal slack thread.

Workspace affected

No response

Expected behavior

No response

Example repository

No response

Anything else?

No response

@jenting jenting added type: bug Something isn't working team: webapp Issue belongs to the WebApp team labels Jul 15, 2022
@geropl
Copy link
Member

geropl commented Jul 20, 2022

@jenting Could you elaborate on what the actual problem is here? The fact that we're querying the DB when deregistering the cluster? Or that you feel the numbers aren't correct? If it's the latter: What was the number you expected to see? E.g. in the cluster/by asking ws-manager?

@geropl geropl moved this to Clarification in 🍎 WebApp Team Jul 20, 2022
@jenting
Copy link
Contributor Author

jenting commented Jul 21, 2022

@jenting Could you elaborate on what the actual problem is here? The fact that we're querying the DB when deregistering the cluster? Or that you feel the numbers aren't correct? If it's the latter: What was the number you expected to see? E.g. in the cluster/by asking ws-manager?

Well, the problem is that we run the werft job workspace-cluster-delete to tear down the cluster, and it displays the 2 instances are still running. However, after running the SQL query command to the production DB directly, it displays that 10 workspace works are in a stopping or stopped state.

We are not sure whether the problem is the code bug or we query the wrong SQL command to the production DB. You could check the https://gitpod.slack.com/archives/C02F19UUW6S/p1657831120699349 thread to see the SQL command we queried. Thank you.

@jenting jenting changed the title [ws-manager-bridge] the number of workspace instances remaining seems to be wrgone [ws-manager-bridge] the number of workspace instances remaining seems to be wrong Jul 29, 2022
@kylos101
Copy link
Contributor

kylos101 commented Aug 4, 2022

@geropl both, it seems wrong to query the database (but I may be missing historical context as to why we do that), and the numbers between the workspace cluster and database do not match.

Here is an example for us58:

# see how there are zero workspaces in the cluster?
gitpod /workspace/gitpod (main) $ kubectl get pods
NAME                                 READY   STATUS    RESTARTS       AGE
agent-smith-49rd2                    2/2     Running   0              2d3h
agent-smith-ccplz                    2/2     Running   0              3d14h
agent-smith-gsbpp                    2/2     Running   0              3d18h
agent-smith-n7m9s                    2/2     Running   0              2d3h
agent-smith-nr9zg                    2/2     Running   0              5d21h
image-builder-mk3-65f487c8c5-p6fw8   2/2     Running   0              6d
registry-facade-2nlbt                3/3     Running   0              3d14h
registry-facade-9fqnh                3/3     Running   0              3d18h
registry-facade-f9xct                3/3     Running   0              2d3h
registry-facade-r5578                3/3     Running   1 (5d8h ago)   5d22h
registry-facade-wwg7n                3/3     Running   0              2d3h
ws-daemon-5rcpt                      3/3     Running   0              3d18h
ws-daemon-9gjsx                      3/3     Running   0              2d3h
ws-daemon-g2lsz                      3/3     Running   0              3d14h
ws-daemon-kv22v                      3/3     Running   0              5d22h
ws-daemon-pvrx8                      3/3     Running   0              2d3h
ws-manager-84bb5cffd6-6pq5h          2/2     Running   0              2d7h
ws-proxy-c4cb5d5cf-77m27             2/2     Running   0              6d
ws-proxy-c4cb5d5cf-89lnl             2/2     Running   0              6d
ws-proxy-c4cb5d5cf-rp469             2/2     Running   0              6d

In this job, we get: time="2022-08-04T21:21:49Z" level=fatal msg="rpc error: code = Unknown desc = cluster is not empty (14 instances remaining)".

It looks like the workspaces that are being counted in this case are the pending ones, which I think means they didn't necessarily land on a workspace cluster, such as this one.

The query that would have rendered this I think looks similar to:

SELECT dbwi.id, dbwi.phasePersisted, dbw.id, dbwi.deleted 
FROM gitpod.d_b_workspace dbw  
inner join gitpod.d_b_workspace_instance dbwi 
on dbw.id = dbwi.workspaceId  
WHERE 
dbwi.phasePersisted not in ('stopped', 'stopping')
and dbwi.deleted = 0
# change to match a region you're interested in
and dbwi.region = 'us58'
ORDER by 2 asc
;

Which yields 14 pending workspaces.

05bb0a45-a6ff-46e2-9001-ee6686f00b24	pending
237c28fc-93cf-4040-bd5e-b1474f921bfa	pending
3ba75899-2d09-4bc9-b242-2582e24e5fa3	pending
457957db-01fc-4b3f-b3fb-cb76066b011d	pending
524b3d97-9b1b-4cc0-af80-82a8ef21a13a	pending
5ec47396-e0db-4d6f-8cbb-8d76d80dc634	pending
683ad836-3763-455c-9924-78aa40dfbc73	pending
6f925daf-3234-46b7-bf1c-d73d0b337d42	pending
7ea87b1a-78bf-4972-aa98-7388053dae03	pending
824103f9-2a95-412a-afc0-f755c0eba6f6	pending
c2ea0add-f401-4ceb-813f-45d46887c20c	pending
eb43d308-49cd-4311-ac1a-c3369ba8aca9	pending
ecfe5d14-3ce4-4477-a2a8-2fd09acd23dc	pending
fdec6ee9-73a1-4104-9490-a9de771db03f	pending

We've been having phase management problems lately, I imagine this maybe a symptom? I would have expected there to be zero returned by ws-manager-bridge (because of what I saw in the workspace cluster), but ws-manager-bridge returned 14 workspaces.

@kylos101
Copy link
Contributor

kylos101 commented Aug 5, 2022

@geropl I think this may be related #11397

@geropl
Copy link
Member

geropl commented Sep 9, 2022

I think this may be related #11397

Exactly. I will close this a dupe, and scheduled #11397

@geropl geropl closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2022
Repository owner moved this from Clarification to Done in 🍎 WebApp Team Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: webapp Issue belongs to the WebApp team type: bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants