Seldon core operator is restarting due to failed renewal of lease #4147

sujaykulkarn · 2022-06-13T07:10:18Z

Describe the bug

Seldon core operator pod is getting restarted due to failed to retrieve resource lock getting below error logs.

{"version": "1.0", "level": "INFO", "host": "ccs-seldon-75dfcb5bf9-bwkfv.ccs", "system": "ml-inf-seldon", "type": "log", "log": {"message": "E0610 16:36:08.554283 7 leaderelection.go:325] error retrieving resource lock ccs/a33bd623.machinelearning.seldon.io: Get \"https://10.254.0.1:443/apis/coordination.k8s.io/v1/namespaces/ccs/leases/a33bd623.machinelearning.seldon.io\": context deadline exceeded"}, "time": "2022-06-10T16:36:09.511Z"} {"version": "1.0", "level": "INFO", "host": "ccs-seldon-75dfcb5bf9-bwkfv.ccs", "system": "ml-inf-seldon", "type": "log", "log": {"message": "I0610 16:36:08.554523 7 leaderelection.go:278] failed to renew lease ccs/a33bd623.machinelearning.seldon.io: timed out waiting for the condition"}, "time": "2022-06-10T16:36:09.512Z"} {"version": "1.0", "level": "INFO", "host": "ccs-seldon-75dfcb5bf9-bwkfv.ccs", "system": "ml-inf-seldon", "type": "log", "log": {"message": "setup : problem running manager"}, "time": "2022-06-10T16:36:09.512Z"}
Wanted to get more insights on this issue and is this issue is related ( kubernetes/client-go#966 )

To reproduce

Install seldon chart with 2 replicas, keep it active for 2-3 days one or more restarts we will see.

Expected behaviour

Seldon pod must not restart it should retry for lease renewal.

Environment

Cloud Provider: Bare Metal
Kubernetes Cluster Version: [root:ccs-01-control-01 /root]$ kubectl version Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.5", GitCommit:"e338cf2c6d297aa603b50ad3a301f761b4173aa6", GitTreeState:"clean", BuildDate:"2020-12-09T11:18:51Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.5", GitCommit:"e338cf2c6d297aa603b50ad3a301f761b4173aa6", GitTreeState:"clean", BuildDate:"2020-12-09T11:10:32Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Deployed Seldon System Images: v1.11.0

Model Details

Images of your model: NA
Logs of your model: NA

The text was updated successfully, but these errors were encountered:

ukclivecox · 2022-06-24T06:33:59Z

is this related to kubernetes-sigs/kubebuilder#2604
I imagine its a kubebuilder issue or controller-runtime?

ukclivecox · 2022-07-06T18:49:38Z

Is there anything particular about your cluster that would mean resource locks fail?

ukclivecox · 2022-07-06T18:54:10Z

similar issue kedacore/keda#2836

ukclivecox · 2022-07-06T18:55:53Z

One option might be to allow longer deadlines to allow users to handle noisy/network issues in their clusters?

sujaykulkarn · 2022-07-15T02:40:47Z

Hi @cliveseldon @axsaucedo, Many Thanks for the change.
These changes were most needed as sometime clusters have a heavy load and with these parameters, it will be easy to control the leader election process for Seldon. One small query is there any documentation done for the above fix?? Thank you.

ukclivecox · 2022-07-15T06:11:18Z

There is not explicit docs at present. Setting these values require understanding the k8s leadership election process from the controller-runtime docs. Look forward to hearing how you get on. Also adding docs from your experience as a PR would be welcome. Feel free to open an issue.

sujaykulkarn · 2022-08-16T09:09:29Z

Sure, Thank you. May I know when is the planned release for Seldon 1.15?

sujaykulkarn added the bug label Jun 13, 2022

ukclivecox self-assigned this Jun 27, 2022

ukclivecox mentioned this issue Jul 8, 2022

Allow leader election controls for manager #4211

Merged

axsaucedo closed this as completed in #4211 Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seldon core operator is restarting due to failed renewal of lease #4147

Seldon core operator is restarting due to failed renewal of lease #4147

sujaykulkarn commented Jun 13, 2022

ukclivecox commented Jun 24, 2022

ukclivecox commented Jul 6, 2022

ukclivecox commented Jul 6, 2022

ukclivecox commented Jul 6, 2022

sujaykulkarn commented Jul 15, 2022 •

edited

Loading

ukclivecox commented Jul 15, 2022

sujaykulkarn commented Aug 16, 2022

Seldon core operator is restarting due to failed renewal of lease #4147

Seldon core operator is restarting due to failed renewal of lease #4147

Comments

sujaykulkarn commented Jun 13, 2022

Describe the bug

To reproduce

Expected behaviour

Environment

Model Details

ukclivecox commented Jun 24, 2022

ukclivecox commented Jul 6, 2022

ukclivecox commented Jul 6, 2022

ukclivecox commented Jul 6, 2022

sujaykulkarn commented Jul 15, 2022 • edited Loading

ukclivecox commented Jul 15, 2022

sujaykulkarn commented Aug 16, 2022

sujaykulkarn commented Jul 15, 2022 •

edited

Loading