Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator crashing after sometime #188

Closed
SaikiranDaripelli opened this issue Sep 16, 2020 · 16 comments · Fixed by #235
Closed

Operator crashing after sometime #188

SaikiranDaripelli opened this issue Sep 16, 2020 · 16 comments · Fixed by #235
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@SaikiranDaripelli
Copy link

SaikiranDaripelli commented Sep 16, 2020

Hi,
We have an operator written using this SDK, and operator pod is restarting every few hours with below exception

2020-09-16 07:51:39,953 i.f.k.c.d.i.WatchConnectionManager [DEBUG] Current reconnect backoff is 1000 milliseconds (T0)
2020-09-16 07:51:40,953 i.f.k.c.d.i.WatchConnectionManager [DEBUG] Connecting websocket ... io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@71e4b308
2020-09-16 07:51:41,003 i.f.k.c.d.i.WatchConnectionManager [DEBUG] WebSocket successfully opened
2020-09-16 07:51:41,018 c.g.c.o.p.EventScheduler       [ERROR] Error:
io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 22472056 (22832853)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:257)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Code i am using is

         KubernetesClient client = new DefaultKubernetesClient();
        Operator operator = new Operator(client);
        operator.registerController(new KafkaTopicController(client));

Am i using it wrong?

@adam-sandor
Copy link
Collaborator

Hi Saikiran, I don't think you're doing something wrong. The error is happening on the fabric8 level. Let us get back to you asap with some ideas.

@csviri
Copy link
Collaborator

csviri commented Sep 16, 2020

Hi @SaikiranDaripelli ,
Thank you for the issue, you are using it right. Unfortunately this is known issue not in our code but the Kubernetes client we are using: fabric8io/kubernetes-client#1800 - its not handled yet, see also
spring-cloud/spring-cloud-kubernetes#557
fabric8io/kubernetes-client#1318

In this case the restart is a simple workaround from our side, see:

https://github.com/ContainerSolutions/java-operator-sdk/blob/39107a309514a75f1c9fed745f7aa1de1bf4301c/operator-framework/src/main/java/com/github/containersolutions/operator/processing/EventScheduler.java#L142-L148

We will try to take a look on this soon.

@csviri
Copy link
Collaborator

csviri commented Sep 16, 2020

@adam-sandor we could try to reconnect automatically from our side, but that should be done after the current changes in progress.

@adam-sandor
Copy link
Collaborator

Yeah it would be great if we could do something about this. I guess many users of the KubernetesClient don't have this problem as they don't watch things for a long time, but an operator does that by definition.

@SaikiranDaripelli
Copy link
Author

Thanks for answering my query, i went through the fabric8 issue and they seem to suggest to do it on client end.
Retries would definitely help, with restarts i am seeing that all controller's createOrUpdate is getting called everytime after a restart even though there is no change to resource, is it expected and will happen even with retries?
I am using status sub-resource and adding version of last successful version inside status to avoid reprocessing, is it suggested way to workaround duplicate events?
Will this improvement address it? https://github.com/ContainerSolutions/java-operator-sdk/issues/38

@csviri
Copy link
Collaborator

csviri commented Sep 16, 2020

@SaikiranDaripelli in short not, because by default we check if the generation increase, and in this case it won't increase (which can be turned off, in case it will reprocess because we cannot know if it happened during an execution of controller or not). In this case we are maintaining the state (last processed generation) in memory.

The issue: https://github.com/ContainerSolutions/java-operator-sdk/issues/38
is a more tough one, we cannot have that state in memory, since the process gets restarted. It can be stored somewhere else like a configMap or some data store. We don't plan to implement this issue in short term. Although we are up to any suggestions and/or contributions.

@SaikiranDaripelli
Copy link
Author

Thanks, then retries without restart will solve my current issue.
With occasional reprocessing only on restart, which is fine for my usecase.

Regarding storing state, can't sdk itself do what i am doing right now as a workaround, i.e store last successfully processed generation in status sub-resource upon successful controller execution, and discard event if current generation matches one in status sub-resource.

@csviri
Copy link
Collaborator

csviri commented Sep 16, 2020

@SaikiranDaripelli this could be done, it would be nicer if we could do this transparently. In the case when you suggesting we should probably provide some interface how to get the latest generation from the resource (name of the field can be different from different users). So this is definitely one of the ways to go.

We will take a look, after the current changes we are working on.

@adam-sandor
Copy link
Collaborator

How about putting that into an annotation?

@PookiPok
Copy link

Hi, i am encounter the same issue with the release version not match, @SaikiranDaripelli - can you please share how did you solve this issue on your end? is there any other solution for this?

@SaikiranDaripelli
Copy link
Author

@PookiPok Right now there is no way to stop operator controller from restarting.

@charlottemach charlottemach added the kind/bug Categorizes issue or PR as related to a bug. label Sep 24, 2020
@PookiPok
Copy link

@SaikiranDaripelli - So is there any workaround for this for now?

@csviri
Copy link
Collaborator

csviri commented Sep 29, 2020

@PookiPok @SaikiranDaripelli the restarting of controller is the workaround basically (thus it restarts but at least the system does not stop working) :(

We can try to improve on this in the current version, but we are working on a big change now, there it will be easiert to fix.

@PookiPok
Copy link

Thank you, waiting for this fix on the next release
Gil

@ankeetj
Copy link

ankeetj commented Nov 10, 2020

@csviri

I'm facing similar issue with my operator. Because of restart pod is ending up in crash loop status. Is there any update on the fix or any workaround which we can use?

@csviri
Copy link
Collaborator

csviri commented Nov 10, 2020

@adam-sandor @charlottemach @kirek007 We should consider fix this in the current version (before the event sources are released, since that might take a long time)
@ankeetj not at this moment, will discuss it, and might provide a patch sooner then planned.

@csviri csviri linked a pull request Nov 25, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants