ws-manager: fix event workers hang forever #12995

jenting · 2022-09-15T11:19:32Z

Fix workers hang if there are over 100 VolumeSnapshot is ready, and we restart the ws-manager.
The m.notifyPod channel is no receiver, and it causes the 100 event workers to hang forever.
So, the ws-manager can't handle any workspace pod event and volume snapshot event changes.

Description

Fix event workers hang if there are 100+ VolumeSnapshot ready and ws-manager restarts.

Related Issue(s)

Fixes #13007

How to test

Prepare Pod yaml manifest and save it as pod.yaml.

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - name: test
    image: alpine:latest
    volumeMounts:
    - name: volv
      mountPath: /data
  volumes:
  - name: volv
    persistentVolumeClaim:
      claimName: test

Prepare PVC yaml manifest and save it as pvc.yaml.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: rook-ceph-block
  resources:
    requests:
      storage: 1Mi

Prepare VolumeSnapshot yaml manifest and save it as vs-0.yaml.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-0
  annotations:
    gitpod/id: test-0
spec:
  volumeSnapshotClassName: csi-rbdplugin-snapclass
  source:
    persistentVolumeClaimName: test

Prepare 100 VolumeSnapshot yaml manifest.

for i in {1..100}; do cp vs-0.yaml vs-${i}.yaml; sed -i "s/-0/-${i}/g" vs-${i}.yaml; done

Prepare Pod and PVC.
```
kubectl create -f pod.yaml -f pvc.yaml
```
Wait the Pod is running and PVC is bound.

Prepare 100 VolumeSnapshots.

for i in {1..100}; do kubectl apply -f vs-${i}.yaml; done

Waits for all the 100 VolumeSnapshots become ready.
```
kubectl get vs | grep true | wc -l
```

Restart ws-manager

kubectl rollout restart deploy ws-manager

Make sure the workspace can start.

Release Notes

None

Documentation

None

Werft options:

/werft with-preview

jenting · 2022-09-15T11:20:40Z

/werft run with-preview

👍 started the job as gitpod-build-jenting-fix-ws-manager-workers-hang.2
(with .werft/ from main)

Fix workers hang if there are over 100 VolumeSnapshot is ready, and the ws-manager be restarted. The m.notifyPod channel is no receiver, and it causes the 100 event workers to hang. Signed-off-by: JenTing Hsiao <[email protected]>

Signed-off-by: JenTing Hsiao <[email protected]>

jenting · 2022-09-15T14:12:22Z

/werft run with-preview

👍 started the job as gitpod-build-jenting-fix-ws-manager-workers-hang.5
(with .werft/ from main)

roboquat added do-not-merge/work-in-progress release-note-none size/XS labels Sep 15, 2022

roboquat added size/S and removed size/XS labels Sep 15, 2022

jenting changed the title ~~ws-manager: fix event workers hang~~ ws-manager: fix event workers hang forever Sep 15, 2022

jenting added the team: workspace Issue belongs to the Workspace team label Sep 15, 2022

jenting marked this pull request as ready for review September 15, 2022 14:11

jenting requested a review from a team September 15, 2022 14:11

roboquat removed the do-not-merge/work-in-progress label Sep 15, 2022

jenting added 2 commits September 15, 2022 22:12

ws-manager: fix event workers hang

78d001e

Fix workers hang if there are over 100 VolumeSnapshot is ready, and the ws-manager be restarted. The m.notifyPod channel is no receiver, and it causes the 100 event workers to hang. Signed-off-by: JenTing Hsiao <[email protected]>

Add workspapce ID annotation to VolumeSnapshot object for filter event

df50665

Signed-off-by: JenTing Hsiao <[email protected]>

jenting force-pushed the jenting/fix-ws-manager-workers-hang branch from cf68290 to df50665 Compare September 15, 2022 14:12

sagor999 approved these changes Sep 15, 2022

View reviewed changes

roboquat merged commit df91671 into main Sep 15, 2022

roboquat deleted the jenting/fix-ws-manager-workers-hang branch September 15, 2022 22:58

kylos101 mentioned this pull request Sep 19, 2022

Workspaces do not initialize because they cannot connect to ws-daemon #12908

Closed

roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ws-manager: fix event workers hang forever #12995

ws-manager: fix event workers hang forever #12995

jenting commented Sep 15, 2022 •

edited

Loading

jenting commented Sep 15, 2022 •

edited by werft-gitpod-dev-com bot

Loading

jenting commented Sep 15, 2022 •

edited by werft-gitpod-dev-com bot

Loading

ws-manager: fix event workers hang forever #12995

ws-manager: fix event workers hang forever #12995

Conversation

jenting commented Sep 15, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

Werft options:

jenting commented Sep 15, 2022 • edited by werft-gitpod-dev-com bot Loading

jenting commented Sep 15, 2022 • edited by werft-gitpod-dev-com bot Loading

jenting commented Sep 15, 2022 •

edited

Loading

jenting commented Sep 15, 2022 •

edited by werft-gitpod-dev-com bot

Loading

jenting commented Sep 15, 2022 •

edited by werft-gitpod-dev-com bot

Loading