Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workspace pod FailedScheduling event not being properly caught #977

Closed
AObuchow opened this issue Nov 14, 2022 · 5 comments · Fixed by #978
Closed

Workspace pod FailedScheduling event not being properly caught #977

AObuchow opened this issue Nov 14, 2022 · 5 comments · Fixed by #978
Labels
bug Something isn't working
Milestone

Comments

@AObuchow
Copy link
Collaborator

AObuchow commented Nov 14, 2022

Description

In some cases, a workspace deployment's pod will trigger the FailedScheduling event and DWO will not be able to properly catch it.
So far, this seems to occur when a workspace requests more memory or CPU than is available on any node on an OpenShift cluster.

What I have observed is that, although DWO sees the pod's FailedScheduling event, the event.count is set to 0. DWO then checks to see if the event.count is higher than the threshold required to report an error. Since the threshold is 1 (i.e. event.count must = 1 for an error to be reported), the error is never reported, and eventually, the workspace times out.

I have been trying to narrow down the exact cause of this bug, but have been unsuccessful thus far.
However, here are my current findings:

  • This issue seems to be consistently reproducible on OpenShift 4.11 and 4.12. On OpenShift 4.10, the FailedScheduling event is sometimes properly caught, and in other cases, the workspace times out.
    • When this bug occurs (on OpenShift 4.10, 4.11 & 4.12) the event.count is set to 0 and event.series is nil (so event.series.count is unusable)
  • This issue does not occur on Minikube v1.27.1 with Kubernetes v1.25.2
    • When testing, the event.count was set to 1 upon first encountering the FailedSchedulingevent, and the count increments with each occurrence of the event. The event.series is nil.
  • The event.count field has been deprecated in Kubernetes v1.25, and instead the event.series.count field is supposed to be used. However, in my testing on Minikube, the event.series field was still nil.
    • The new event API is k8s.io/api/events/v1 (and the deprecated one which is currently in use in DWO comes from k8s.io/api/core/v1)
      • I have tried using k8s.io/api/events/v1 but encountered the same behaviour as k8s.io/api/core/v1: on affected OpenShift versions, event.count is set to 0 and event.series is nil. There is also auto-generated code to convert back and forth between core v1 events and events v1 (here and here)
  • The default Kubernetes scheduler (which is also used by OpenShift, unless configured otherwise) seems to use the following event recorder implementation, which does not initialize the event count field (thus, we'd expect event.count to be initialized to 0, but oddly, this is not the case on Kubernetes v1.25).

Other notes:

  • @ibuziuk reported that the FailedScheduling event was being caught correctly (on CRC 4.10.14 I believe) when it was caused by a disk pressure issue. I am not sure whether there is something particular about disk-pressure warnings that cause them to be caught, or if this is a case of the FailedScheduling event being sometimes caught on OpenShift 4.10
  • I haven't been able to find any Kubernetes or OpenShift bugs directly related to this issue. However, I did find this Handle K8s events with invalid count or invalid Timestamps elastic/beats#31126 which describes a similar issue with the event count being set to 0.
    • I considered using a similar approach to this PR, however this could still lead to breaking DWO's detection of the FailedMount event, as we allow 3 FailedMount events to occur before reporting an error.
    • I'm not sure if it is easy to reproduce the FailedMount event for testing purposes, but if its event count is being correctly set and incremented, then the above approach could suffice.
  • We could add extra checks (after checking pod events) for the pod.status.conditions to see if the PodScheduled condition is set to False with the Unschedulable reason, and fail the workspace if this condition is found. However, this seems less fine-grained than checking pod events. Perhaps there's another way to determine if a pod will not be able to be scheduled, other than checking pod events and pod status conditions?

How To Reproduce

Apply either of the following devworkspaces to an OpenShift 4.11 or 4.12 cluster (on 4.10, this issue may or may not occur), and see that they will time out rather than fail immediately:

kind: DevWorkspace
apiVersion: workspace.devfile.io/v1alpha2
metadata:
  name: theia-next-high-cpu
spec:
  started: true
  template:
    projects:
      - name: web-nodejs-sample
        git:
          remotes:
            origin: "https://github.com/che-samples/web-nodejs-sample.git"
    components:
      - name: theia
        plugin:
          uri: https://che-plugin-registry-main.surge.sh/v3/plugins/eclipse/che-theia/next/devfile.yaml
          components:
            - name: theia-ide
              container:
                env:
                  - name: THEIA_HOST
                    value: 0.0.0.0
                memoryRequest: 2Gi
                memoryLimit: 16Gi
                cpuRequest: 4000m
                cpuLimit: 8000m
    commands:
      - id: say-hello
        exec:
          component: theia-ide
          commandLine: echo "Hello from $(pwd)"
          workingDir: ${PROJECTS_ROOT}/project/app
kind: DevWorkspace
apiVersion: workspace.devfile.io/v1alpha2
metadata:
  name: theia-next-high-mem
spec:
  started: true
  template:
    projects:
      - name: web-nodejs-sample
        git:
          remotes:
            origin: "https://github.com/che-samples/web-nodejs-sample.git"
    components:
      - name: theia
        plugin:
          uri: https://che-plugin-registry-main.surge.sh/v3/plugins/eclipse/che-theia/next/devfile.yaml
          components:
            - name: theia-ide
              container:
                env:
                  - name: THEIA_HOST
                    value: 0.0.0.0
                memoryRequest: 32Gi
                memoryLimit: 64Gi
                cpuRequest: 500m
                cpuLimit: 2000m
    commands:
      - id: say-hello
        exec:
          component: theia-ide
          commandLine: echo "Hello from $(pwd)"
          workingDir: ${PROJECTS_ROOT}/project/app

When checking the workspace pod with kubectl/oc or the OpenShift UI, you can see that the pod is pending and that an event similar to the following has occurred:
0/6 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 6 Insufficient cpu. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.

image

Expected behaviour

The FailedScheduling event should be caught and reported, and the workspace should be failed immediately.

Additional context

Downstream issue

@AObuchow AObuchow added the bug Something isn't working label Nov 14, 2022
@ibuziuk
Copy link
Contributor

ibuziuk commented Nov 15, 2022

@AObuchow thank you for your investigation. I believe we need to report the issue to OpenShift BZ if the event count is not initialized / incremented correctly.

on the DWO end in order to mitigate the issue I would propose to have smth like UnrecoverablePodConditions and track Unschedulable in a similar fashion we track UnrecoverableEvents @l0rd @amisevsk wdyt?

status:
  phase: Pending
  conditions:
    - type: PodScheduled
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2022-11-15T13:56:48Z'
      reason: Unschedulable
      message: >-
        0/15 nodes are available: 3 node(s) had taint
        {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3
        node(s) had taint {node-role.kubernetes.io/master: }, that the pod
        didn't tolerate, 9 Insufficient memory.
  qosClass: Burstable

@amisevsk
Copy link
Collaborator

I don't recall why we didn't initially have an implementation that looks at pod conditions as well as deployment conditions, but this approach makes sense to me.

AObuchow added a commit to AObuchow/devworkspace-operator that referenced this issue Nov 15, 2022
AObuchow added a commit to AObuchow/devworkspace-operator that referenced this issue Nov 15, 2022
AObuchow added a commit to AObuchow/devworkspace-operator that referenced this issue Nov 15, 2022
AObuchow added a commit to AObuchow/devworkspace-operator that referenced this issue Nov 15, 2022
AObuchow added a commit to AObuchow/devworkspace-operator that referenced this issue Nov 15, 2022
@l0rd
Copy link
Collaborator

l0rd commented Nov 15, 2022

Looking at the pod condition looks like a great idea.

I also think that checking that event.count is higher than a threshold is not required. A FailedScheduling should be reported as early as possible to the developer by the dashboard (and eventually let the developer decide to wait / retry / cancel).

@amisevsk
Copy link
Collaborator

I also think that checking that event.count is higher than a threshold is not required. A FailedScheduling should be reported as early as possible to the developer by the dashboard (and eventually let the developer decide to wait / retry / cancel).

The reason we have the threshold mechanism is that we were seeing random workspace failures due to occasional FailedMount events -- see #778. However, if count is never being incremented anymore, that functionality likely does not work anyways.

Another solution here might be to use thresholds of 0 instead of 1; that would amount to not checking the threshold at all.

@AObuchow
Copy link
Collaborator Author

AObuchow commented Feb 16, 2023

The fix for this issue (0cad9a0) is effectively obsolete from the fix for #987. However, 0cad9a0 has the downside that, the user will be unable to ignore the FailedScheduling event if they chose to do so (because the PodUnschedulable condition will still be caught).

Thus, I think it's worth reverting the work done in 0cad9a0 as it has no added benefits, and has the disadvantage of preventing users from ignoring the FailedScheduling event.

See #1046

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants