[ws-manager-bridge] the number of workspace instances remaining seems to be wrong #11399

jenting · 2022-07-15T03:20:17Z

Bug description

We wanted to delete the us53 cluster, the number of workspace pod is empty, so we triggered the werft job to delete the us53 cluster. But somehow, the ws-manager-bridge reported the number of running instances is 2.

Updated property [core/project].
downloading sha256:c479b2288f1e17ae558d565dcd1d686765cf4b9c310a5c85d2bbcdc644074b91  100% || (21/21 MB, 51.481 MB/s)
Switched to context "us53".
🚀 Retrieving governing meta cluster prod-meta-us02 kubeconfig
Fetching cluster endpoint and auth data.
kubeconfig entry generated for prod-meta-us02.
🚀 Changing kubectx to prod-meta-us02
Switched to context "gke_gitpod-191109_us-west1_prod-meta-us02".
Context "gke_gitpod-191109_us-west1_prod-meta-us02" modified.
Active namespace is "default".
🚀 Deregistering cluster us53
time="2022-07-14T20:35:29Z" level=fatal msg="rpc error: code = Unknown desc = cluster is not empty (2 instances remaining)"

After that, we check the DB according to the code logic

gitpod/components/ws-manager-bridge/src/cluster-service-server.ts

Line 263 in c752a16

const instances = await this.workspaceDB.findRegularRunningInstances();

gitpod/components/gitpod-db/src/typeorm/workspace-db-impl.ts

Lines 450 to 485 in c752a16

    
           public async findRegularRunningInstances(userId?: string): Promise<WorkspaceInstance[]> { 
        
               const infos = await this.findRunningInstancesWithWorkspaces(undefined, userId); 
        
               return infos.filter((info) => info.workspace.type === "regular").map((wsinfo) => wsinfo.latestInstance); 
        
           } 
        
           public async findRunningInstancesWithWorkspaces( 
        
               installation?: string, 
        
               userId?: string, 
        
               includeStopping: boolean = false, 
        
           ): Promise<RunningWorkspaceInfo[]> { 
        
               const params: any = {}; 
        
               const conditions = ["wsi.phasePersisted != 'stopped'", "wsi.deleted != TRUE"]; 
        
               if (!includeStopping) { 
        
                   // This excludes instances in a 'stopping' phase 
        
                   conditions.push("wsi.phasePersisted != 'stopping'"); 
        
               } 
        
               if (installation) { 
        
                   params.region = installation; 
        
                   conditions.push("wsi.region = :region"); 
        
               } 
        
               const joinParams: any = {}; 
        
               const joinConditions = []; 
        
               if (userId) { 
        
                   joinParams.userId = userId; 
        
                   joinConditions.push("ws.ownerId = :userId"); 
        
               } 
        
               return this.doJoinInstanceWithWorkspace<RunningWorkspaceInfo>( 
        
                   conditions, 
        
                   params, 
        
                   joinConditions, 
        
                   joinParams, 
        
                   (wsi, ws) => { 
        
                       return { workspace: ws, latestInstance: wsi }; 
        
                   }, 
        
               ); 
        
           }

The number of workspace according to the criteria using SQL query is 10, rather than 2.

We are not sure is it the bug of ws-manager-bridge or we input the wrong query parameters.

Steps to reproduce

Internal slack thread.

Workspace affected

No response

Expected behavior

No response

Example repository

No response

Anything else?

No response

geropl · 2022-07-20T14:54:59Z

@jenting Could you elaborate on what the actual problem is here? The fact that we're querying the DB when deregistering the cluster? Or that you feel the numbers aren't correct? If it's the latter: What was the number you expected to see? E.g. in the cluster/by asking ws-manager?

jenting · 2022-07-21T02:33:46Z

@jenting Could you elaborate on what the actual problem is here? The fact that we're querying the DB when deregistering the cluster? Or that you feel the numbers aren't correct? If it's the latter: What was the number you expected to see? E.g. in the cluster/by asking ws-manager?

Well, the problem is that we run the werft job workspace-cluster-delete to tear down the cluster, and it displays the 2 instances are still running. However, after running the SQL query command to the production DB directly, it displays that 10 workspace works are in a stopping or stopped state.

We are not sure whether the problem is the code bug or we query the wrong SQL command to the production DB. You could check the https://gitpod.slack.com/archives/C02F19UUW6S/p1657831120699349 thread to see the SQL command we queried. Thank you.

kylos101 · 2022-08-04T21:42:47Z

@geropl both, it seems wrong to query the database (but I may be missing historical context as to why we do that), and the numbers between the workspace cluster and database do not match.

Here is an example for us58:

# see how there are zero workspaces in the cluster?
gitpod /workspace/gitpod (main) $ kubectl get pods
NAME                                 READY   STATUS    RESTARTS       AGE
agent-smith-49rd2                    2/2     Running   0              2d3h
agent-smith-ccplz                    2/2     Running   0              3d14h
agent-smith-gsbpp                    2/2     Running   0              3d18h
agent-smith-n7m9s                    2/2     Running   0              2d3h
agent-smith-nr9zg                    2/2     Running   0              5d21h
image-builder-mk3-65f487c8c5-p6fw8   2/2     Running   0              6d
registry-facade-2nlbt                3/3     Running   0              3d14h
registry-facade-9fqnh                3/3     Running   0              3d18h
registry-facade-f9xct                3/3     Running   0              2d3h
registry-facade-r5578                3/3     Running   1 (5d8h ago)   5d22h
registry-facade-wwg7n                3/3     Running   0              2d3h
ws-daemon-5rcpt                      3/3     Running   0              3d18h
ws-daemon-9gjsx                      3/3     Running   0              2d3h
ws-daemon-g2lsz                      3/3     Running   0              3d14h
ws-daemon-kv22v                      3/3     Running   0              5d22h
ws-daemon-pvrx8                      3/3     Running   0              2d3h
ws-manager-84bb5cffd6-6pq5h          2/2     Running   0              2d7h
ws-proxy-c4cb5d5cf-77m27             2/2     Running   0              6d
ws-proxy-c4cb5d5cf-89lnl             2/2     Running   0              6d
ws-proxy-c4cb5d5cf-rp469             2/2     Running   0              6d

In this job, we get: time="2022-08-04T21:21:49Z" level=fatal msg="rpc error: code = Unknown desc = cluster is not empty (14 instances remaining)".

It looks like the workspaces that are being counted in this case are the pending ones, which I think means they didn't necessarily land on a workspace cluster, such as this one.

The query that would have rendered this I think looks similar to:

SELECT dbwi.id, dbwi.phasePersisted, dbw.id, dbwi.deleted 
FROM gitpod.d_b_workspace dbw  
inner join gitpod.d_b_workspace_instance dbwi 
on dbw.id = dbwi.workspaceId  
WHERE 
dbwi.phasePersisted not in ('stopped', 'stopping')
and dbwi.deleted = 0
# change to match a region you're interested in
and dbwi.region = 'us58'
ORDER by 2 asc
;

Which yields 14 pending workspaces.

05bb0a45-a6ff-46e2-9001-ee6686f00b24	pending
237c28fc-93cf-4040-bd5e-b1474f921bfa	pending
3ba75899-2d09-4bc9-b242-2582e24e5fa3	pending
457957db-01fc-4b3f-b3fb-cb76066b011d	pending
524b3d97-9b1b-4cc0-af80-82a8ef21a13a	pending
5ec47396-e0db-4d6f-8cbb-8d76d80dc634	pending
683ad836-3763-455c-9924-78aa40dfbc73	pending
6f925daf-3234-46b7-bf1c-d73d0b337d42	pending
7ea87b1a-78bf-4972-aa98-7388053dae03	pending
824103f9-2a95-412a-afc0-f755c0eba6f6	pending
c2ea0add-f401-4ceb-813f-45d46887c20c	pending
eb43d308-49cd-4311-ac1a-c3369ba8aca9	pending
ecfe5d14-3ce4-4477-a2a8-2fd09acd23dc	pending
fdec6ee9-73a1-4104-9490-a9de771db03f	pending

We've been having phase management problems lately, I imagine this maybe a symptom? I would have expected there to be zero returned by ws-manager-bridge (because of what I saw in the workspace cluster), but ws-manager-bridge returned 14 workspaces.

kylos101 · 2022-08-05T18:21:19Z

@geropl I think this may be related #11397

geropl · 2022-09-09T13:51:19Z

I think this may be related #11397

Exactly. I will close this a dupe, and scheduled #11397

jenting added type: bug Something isn't working team: webapp Issue belongs to the WebApp team labels Jul 15, 2022

jenting added this to 🍎 WebApp Team Jul 15, 2022

geropl moved this to Clarification in 🍎 WebApp Team Jul 20, 2022

jenting changed the title ~~[ws-manager-bridge] the number of workspace instances remaining seems to be wrgone~~ [ws-manager-bridge] the number of workspace instances remaining seems to be wrong Jul 29, 2022

geropl closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2022

Repository owner moved this from Clarification to Done in 🍎 WebApp Team Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ws-manager-bridge] the number of workspace instances remaining seems to be wrong #11399

[ws-manager-bridge] the number of workspace instances remaining seems to be wrong #11399

jenting commented Jul 15, 2022 •

edited

Loading

geropl commented Jul 20, 2022

jenting commented Jul 21, 2022

kylos101 commented Aug 4, 2022 •

edited

Loading

kylos101 commented Aug 5, 2022

geropl commented Sep 9, 2022

[ws-manager-bridge] the number of workspace instances remaining seems to be wrong #11399

[ws-manager-bridge] the number of workspace instances remaining seems to be wrong #11399

Comments

jenting commented Jul 15, 2022 • edited Loading

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

geropl commented Jul 20, 2022

jenting commented Jul 21, 2022

kylos101 commented Aug 4, 2022 • edited Loading

kylos101 commented Aug 5, 2022

geropl commented Sep 9, 2022

jenting commented Jul 15, 2022 •

edited

Loading

kylos101 commented Aug 4, 2022 •

edited

Loading