Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ws-manager: fix event workers hang forever #12995

Merged
merged 2 commits into from
Sep 15, 2022

Conversation

jenting
Copy link
Contributor

@jenting jenting commented Sep 15, 2022

Fix workers hang if there are over 100 VolumeSnapshot is ready, and we restart the ws-manager.
The m.notifyPod channel is no receiver, and it causes the 100 event workers to hang forever.
So, the ws-manager can't handle any workspace pod event and volume snapshot event changes.

Description

Fix event workers hang if there are 100+ VolumeSnapshot ready and ws-manager restarts.

Related Issue(s)

Fixes #13007

How to test

  • Prepare Pod yaml manifest and save it as pod.yaml.

    apiVersion: v1
    kind: Pod
    metadata:
      name: test
    spec:
      containers:
      - name: test
        image: alpine:latest
        volumeMounts:
        - name: volv
          mountPath: /data
      volumes:
      - name: volv
        persistentVolumeClaim:
          claimName: test
  • Prepare PVC yaml manifest and save it as pvc.yaml.

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test
    spec:
      accessModes:
        - ReadWriteOnce
      storageClassName: rook-ceph-block
      resources:
        requests:
          storage: 1Mi
  • Prepare VolumeSnapshot yaml manifest and save it as vs-0.yaml.

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    metadata:
      name: test-0
      annotations:
        gitpod/id: test-0
    spec:
      volumeSnapshotClassName: csi-rbdplugin-snapclass
      source:
        persistentVolumeClaimName: test
  • Prepare 100 VolumeSnapshot yaml manifest.

    for i in {1..100}; do cp vs-0.yaml vs-${i}.yaml; sed -i "s/-0/-${i}/g" vs-${i}.yaml; done
  • Prepare Pod and PVC.

    kubectl create -f pod.yaml -f pvc.yaml
  • Wait the Pod is running and PVC is bound.

  • Prepare 100 VolumeSnapshots.

    for i in {1..100}; do kubectl apply -f vs-${i}.yaml; done
  • Waits for all the 100 VolumeSnapshots become ready.

    kubectl get vs | grep true | wc -l
  • Restart ws-manager

    kubectl rollout restart deploy ws-manager
  • Make sure the workspace can start.

Release Notes

None

Documentation

None

Werft options:

  • /werft with-preview

@jenting
Copy link
Contributor Author

jenting commented Sep 15, 2022

/werft run with-preview

👍 started the job as gitpod-build-jenting-fix-ws-manager-workers-hang.2
(with .werft/ from main)

@roboquat roboquat added size/S and removed size/XS labels Sep 15, 2022
@jenting jenting changed the title ws-manager: fix event workers hang ws-manager: fix event workers hang forever Sep 15, 2022
@jenting jenting added the team: workspace Issue belongs to the Workspace team label Sep 15, 2022
@jenting jenting marked this pull request as ready for review September 15, 2022 14:11
@jenting jenting requested a review from a team September 15, 2022 14:11
Fix workers hang if there are over 100 VolumeSnapshot is ready, and the ws-manager be restarted.
The m.notifyPod channel is no receiver, and it causes the 100 event workers to hang.

Signed-off-by: JenTing Hsiao <[email protected]>
@jenting jenting force-pushed the jenting/fix-ws-manager-workers-hang branch from cf68290 to df50665 Compare September 15, 2022 14:12
@jenting
Copy link
Contributor Author

jenting commented Sep 15, 2022

/werft run with-preview

👍 started the job as gitpod-build-jenting-fix-ws-manager-workers-hang.5
(with .werft/ from main)

@roboquat roboquat merged commit df91671 into main Sep 15, 2022
@roboquat roboquat deleted the jenting/fix-ws-manager-workers-hang branch September 15, 2022 22:58
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/S team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PVC] ws-manager event workers hang forever once over 100 VolumeSnapshots and ws-manager restart
3 participants