Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove grpcServer.Stop() when stream.Send() fails on ListAndWatch #412

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

syaganti
Copy link

The error handling of failures to stream.Send(resp) during ListAndWatch is not completely reliable.

Take for instance, if stream.Send() fails and kubelet never restarts, then stopping the grpcServer will mean device plugin will never reconnect with kubelet because the socket will still exist in /var/lib/kubelet/device-plugins/ and therefore device plugin will not trigger itself to re-register with kubelet and the grpc server is never started again.

the grpcServer is only created during registeration so in this case we will be stuck in a state where the grpcServer is stopped with no way of starting it again. This will put Device Plugins in an unrecoverable state.

Instead we should not stop the grpcServer on failure of send and continue with ListAndWatch logic and retry send on the next device health update.

The error handling of failures to stream.Send(resp) during ListAndWatch is not completely reliable.

Take for instance, if stream.Send() fails and kubelet never restarts, then stopping the grpcServer will mean device plugin will never reconnect with kubelet because the socket will still exist in `/var/lib/kubelet/device-plugins/` and therefore device plugin will not trigger itself to re-register with kubelet and the grpc server is never started again.

the grpcServer is only created during registeration so in this case we will be stuck in a state where the grpcServer is stopped with no way of starting it again. This will put Device Plugins in an unrecoverable state. 

Instead we should not stop the grpcServer on failure of send and continue with ListAndWatch logic and retry send on the next device health update.
@SergeyKanzhelev
Copy link

/cc @ffromani

Interestingly the NVIDIA version is doing something similar: https://github.com/NVIDIA/k8s-device-plugin/blob/620aaed3474d6dca586451ec63ca8e6b94100410/internal/plugin/server.go#L278C5-L278C15 but not the same.

I wonder what is the best practice here. I feel like retry like suggested in this PR is a good way to go. @syaganti maybe it is a good idea to file an issue in NVIDIA version of a device plugin as well.

@ffromani
Copy link

ffromani commented Oct 23, 2024

/cc @ffromani

Interestingly the NVIDIA version is doing something similar: https://github.com/NVIDIA/k8s-device-plugin/blob/620aaed3474d6dca586451ec63ca8e6b94100410/internal/plugin/server.go#L278C5-L278C15 but not the same.

I wonder what is the best practice here. I feel like retry like suggested in this PR is a good way to go. @syaganti maybe it is a good idea to file an issue in NVIDIA version of a device plugin as well.

@SergeyKanzhelev in general I'd support an organized effort from us in the kubernetes 1.33 cycle to improve things in this area, there are multiple related issues/efforts ongoing. Maybe we need a KEP or something more structured.

I'll dig into git history and into past conversation and comment about recommendations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants