-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workspace pod FailedScheduling event not being properly caught #977
Comments
@AObuchow thank you for your investigation. I believe we need to report the issue to OpenShift BZ if the event count is not initialized / incremented correctly. on the DWO end in order to mitigate the issue I would propose to have smth like
|
I don't recall why we didn't initially have an implementation that looks at pod conditions as well as deployment conditions, but this approach makes sense to me. |
Fix devfile#977 Signed-off-by: Andrew Obuchowicz <[email protected]>
Fix devfile#977 Signed-off-by: Andrew Obuchowicz <[email protected]>
Fix devfile#977 Signed-off-by: Andrew Obuchowicz <[email protected]>
Fix devfile#977 Signed-off-by: Andrew Obuchowicz <[email protected]>
Fix devfile#977 Signed-off-by: Andrew Obuchowicz <[email protected]>
Looking at the pod condition looks like a great idea. I also think that checking that |
The reason we have the threshold mechanism is that we were seeing random workspace failures due to occasional Another solution here might be to use thresholds of |
The fix for this issue (0cad9a0) is effectively obsolete from the fix for #987. However, 0cad9a0 has the downside that, the user will be unable to ignore the Thus, I think it's worth reverting the work done in 0cad9a0 as it has no added benefits, and has the disadvantage of preventing users from ignoring the See #1046 |
Description
In some cases, a workspace deployment's pod will trigger the
FailedScheduling
event and DWO will not be able to properly catch it.So far, this seems to occur when a workspace requests more memory or CPU than is available on any node on an OpenShift cluster.
What I have observed is that, although DWO sees the pod's
FailedScheduling
event, theevent.count
is set to 0. DWO then checks to see if theevent.count
is higher than the threshold required to report an error. Since the threshold is 1 (i.e.event.count
must = 1 for an error to be reported), the error is never reported, and eventually, the workspace times out.I have been trying to narrow down the exact cause of this bug, but have been unsuccessful thus far.
However, here are my current findings:
FailedScheduling
event is sometimes properly caught, and in other cases, the workspace times out.event.count
is set to 0 andevent.series
isnil
(soevent.series.count
is unusable)event.count
was set to 1 upon first encountering theFailedScheduling
event, and the count increments with each occurrence of the event. Theevent.series
isnil
.event.count
field has been deprecated in Kubernetes v1.25, and instead theevent.series.count
field is supposed to be used. However, in my testing on Minikube, theevent.series
field was stillnil
.k8s.io/api/events/v1
(and the deprecated one which is currently in use in DWO comes fromk8s.io/api/core/v1
)k8s.io/api/events/v1
but encountered the same behaviour ask8s.io/api/core/v1
: on affected OpenShift versions,event.count
is set to 0 andevent.series
is nil. There is also auto-generated code to convert back and forth between core v1 events and events v1 (here and here)event.count
to be initialized to 0, but oddly, this is not the case on Kubernetes v1.25).runCommand
->Setup
is called, which executesoptions.Config()
-> The kube-scheduler's event broadcaster is initialized as an event broadcaster adapter. The event broadcaster adapter checks if the cluster supportsk8s.io/api/events/v1
resources (i.e. the new events API), and if so, an events/v1 EventBroadcaster will be used (otherwise a core/v1 events EventBroadcaster is used). In my testing on OpenShift 4.11 and Minikube,k8s.io/api/events/v1
resources were supported, so the new event API should be in use by the EventBroadcaster.recorderFactory
from thek8s.io/api/events/v1
EventBroadcaster. This recorderFactory is used by the kubernetes scheduler struct: a map of scheduling profiles to frameworks allows the association of pods to thek8s.io/api/events/v1
event recorder when a pod is being scheduled. Finally, when the scheduler's failure handler is run, theFailedScheduling
event is created using thek8s.io/api/events/v1
event recorder.Other notes:
FailedScheduling
event was being caught correctly (on CRC 4.10.14 I believe) when it was caused by a disk pressure issue. I am not sure whether there is something particular about disk-pressure warnings that cause them to be caught, or if this is a case of theFailedScheduling
event being sometimes caught on OpenShift 4.10FailedMount
event, as we allow 3FailedMount
events to occur before reporting an error.FailedMount
event for testing purposes, but if its event count is being correctly set and incremented, then the above approach could suffice.pod.status.conditions
to see if thePodScheduled
condition is set to False with theUnschedulable
reason, and fail the workspace if this condition is found. However, this seems less fine-grained than checking pod events. Perhaps there's another way to determine if a pod will not be able to be scheduled, other than checking pod events and pod status conditions?How To Reproduce
Apply either of the following devworkspaces to an OpenShift 4.11 or 4.12 cluster (on 4.10, this issue may or may not occur), and see that they will time out rather than fail immediately:
When checking the workspace pod with kubectl/oc or the OpenShift UI, you can see that the pod is pending and that an event similar to the following has occurred:
0/6 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 6 Insufficient cpu. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
Expected behaviour
The
FailedScheduling
event should be caught and reported, and the workspace should be failed immediately.Additional context
Downstream issue
The text was updated successfully, but these errors were encountered: