Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-watch events when disconnected from k8s API #570

Closed
wants to merge 4 commits into from
Closed

Re-watch events when disconnected from k8s API #570

wants to merge 4 commits into from

Conversation

yuusaku-t
Copy link
Contributor

@yuusaku-t yuusaku-t commented Jun 23, 2020

If any errors occur and kubernetes client is disconnected from the API, EventBasedConfigurationChangeDetector should re-watch the events for external resources like ConfigMap. This patch implements the onClose methods of Watcher for the error handling:

/**
 * Run when the watcher finally closes.
 *
 * @param cause What caused the watcher to be closed. Null means normal close.
 */
void onClose(KubernetesClientException cause);

ref: https://github.com/fabric8io/kubernetes-client/blob/v4.4.0/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/Watcher.java#L27

An example of the errors

In my case, the following exception is passed to onClose method under the same situation as reported in #558:

io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 35168187 (41169815)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) ~[kubernetes-client-4.9.0.jar!/:na]
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) ~[okhttp-3.12.0.jar!/:na]
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) ~[okhttp-3.12.0.jar!/:na]
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) ~[okhttp-3.12.0.jar!/:na]
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) ~[okhttp-3.12.0.jar!/:na]
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) ~[okhttp-3.12.0.jar!/:na]
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) ~[okhttp-3.12.0.jar!/:na]
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) ~[okhttp-3.12.0.jar!/:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:834) ~[na:n

The kubernetes client will not automatically reconnect when such an error occurs. The following code is causing the reconnection to be interrupted:

https://github.com/fabric8io/kubernetes-client/blob/v4.4.0/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L285-L292

Releated Issues

if (log.isDebugEnabled()) {
log.warn("Try to re-watch: " + name, error);
}
watchConfigMap(name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not get into some kind of infinite loop where we indefinitely retry here? It would be nice if we could add some backoff functionality or even just try a set number of times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll try to implement backoff because the errors reported in #558 occur periodically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added backoff functionality by ba47a20. Please check it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanjbaxter Should I make any additional changes?

@brachipa
Copy link

@ryanjbaxter @yuusaku-t seems like I have same issue, watcher failed and doesn't reconnect again, what is the status of that PR? will it be merged soon?

@yuusaku-t
Copy link
Contributor Author

I'm waiting for a re-review.

Copy link
Contributor

@ryanjbaxter ryanjbaxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow response! Could you add some tests?

@yuusaku-t
Copy link
Contributor Author

@ryanjbaxter I added test cases by 613eb3d. Could you please check?

@ryanjbaxter
Copy link
Contributor

@yuusaku-t I was reviewing this with other team members and they were wondering if we could use Spring Retry? This could simplify the implementation a bit. Did you look into that at all?

@yuusaku-t
Copy link
Contributor Author

@ryanjbaxter I hadn't thought about using Spring Retry at all, because it's not included in the spring-cloud-kubernetes-config project's dependency libraries. Let me see if I can simplify the implementation using Spring Retry.

@ryanjbaxter
Copy link
Contributor

@yuusaku-t thanks, one thought might to make it optional and have retry implementation of spring retry is on the class path

@ryanjbaxter
Copy link
Contributor

Spring Retry is also being used in this PR #648

@whgibbo
Copy link

whgibbo commented Feb 22, 2021

Will this also address #730

@yuusaku-t
Copy link
Contributor Author

@ryanjbaxter Sorry for the delay in response. Could you please check 57bf977 ?
I didn't use Spring Retry, but I think it's a little bit cleaner. I've tried various approaches to see if I can implement this using Spring Retry, but it's difficult and more time consuming. I would like to remove it from the scope of this pull request.

@PetricHwang
Copy link

I have also encountered this problem.

@PetricHwang
Copy link

When will it be fixed and released?

@ryanjbaxter
Copy link
Contributor

I actually think this was addressed in this PR #842

It should be in our 2020.0.4 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants