Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two SharedInformer issues related to kube-apiserver unavailable and relisting #1961

Closed
kolorful opened this issue Jan 22, 2020 · 2 comments · Fixed by #2022
Closed

Two SharedInformer issues related to kube-apiserver unavailable and relisting #1961

kolorful opened this issue Jan 22, 2020 · 2 comments · Fixed by #2022
Assignees

Comments

@kolorful
Copy link
Contributor

kolorful commented Jan 22, 2020

Previous discussed in #1943

I encountered two issues related to shared informer (v4.7.0) when conducting destructive testing on our Kubernetes cluster.

  1. The first issue is shared informer won't relist automatically after seeing 410_GONE. How to reproduce:
  • start a shared informer for Job and launch a Job
  • stop kube-apiserver process
  • restart kube-apiserver process after a few seconds
  • you shall see the same error keep showing up in the logs and informer always returns true for hasSynced(), but the cache won't get any update until the next re-list.

Is it possible to let shared informer detect such issue and start a re-list right away?

On the other hand, the error seems like it falls into a wrong if condition block.

Jan 15 05:40:08: [OkHttp https://127.0.0.1:6443/...] ERROR io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Could not deserialize watch event: {"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 544015 (544921)","reason":"Gone","code":410}}
Jan 15 05:40:08: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot construct instance of `io.fabric8.kubernetes.api.model.NodeStatus` (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('Failure')
Jan 15 05:40:08: at [Source: (String)"{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 544015 (544921)","reason":"Gone","code":410}"; line: 1, column: 59] (through reference chain: io.fabric8.kubernetes.api.model.Node["status"])
Jan 15 05:40:08: at com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:63)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1429)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.DeserializationContext.handleMissingInstantiator(DeserializationContext.java:1059)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.ValueInstantiator._createFromStringFallbacks(ValueInstantiator.java:371)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromString(StdValueInstantiator.java:323)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromString(BeanDeserializerBase.java:1373)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeOther(BeanDeserializer.java:171)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:161)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:288)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4202)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3205)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3173)
Jan 15 05:40:08: at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:279)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
Jan 15 05:40:08: at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
Jan 15 05:40:08: at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
Jan 15 05:40:08: at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
Jan 15 05:40:08: at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
Jan 15 05:40:08: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jan 15 05:40:08: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jan 15 05:40:08: at java.lang.Thread.run(Thread.java:745)

The error comes from this line:

watcher.eventReceived(Watcher.Action.valueOf(watchEventType), mapper.readValue(watchObjectAsString, baseOperation.getType()));

I think it somehow mistaken Status as CRD and fall into the wrong logic block.

should in This block is suppose to handle it 410 correctly?

  1. The second issue I observe is that informer's hasSynced() always returns true even after re-sync period is passed and re-sync failed.

How to reproduce:

  • start a shared informer for Job and launch a Job and keep checking if informer is cached
  • stop kube-apiserver process
  • wait for the re-sync period to pass

What I expect: re-list should fail due to kube-apiserver unavailable and hasSynced() returns false.
What actually happens: hasSynced() keep returning true without the cache being updated and client has no way to catch this error and restart the informers.

Is this expected behaviour? It seems like in client-go you can pass a stop channel to the informer and if anything went wrong client can catch that and try restarting the informers, but we don't have that ability in java yet.

@kolorful kolorful changed the title Two SharedInformer relist issues Two SharedInformer issues related to kube-apiserver outage and relisting Jan 22, 2020
@kolorful kolorful changed the title Two SharedInformer issues related to kube-apiserver outage and relisting Two SharedInformer issues related to kube-apiserver unavailable and relisting Jan 22, 2020
@rohanKanojia rohanKanojia self-assigned this Jan 29, 2020
@rohanKanojia
Copy link
Member

@kolorful : Hi, I've started working on this issue. Could you please share an example of a stop channel to the informer in go client you mentioned in last paragraph? I think it will be a good addition and I can try integrating it alongwith fixing these two bugs.

@kolorful
Copy link
Contributor Author

Thank you @rohanKanojia, here is an example from sample-controller.

rohanKanojia added a commit to rohanKanojia/kubernetes-client that referenced this issue Feb 15, 2020
…er unavailable and relisting

+ relist when 410 is received
+ set HasSynced() to false when Reflector faces error
rohanKanojia added a commit to rohanKanojia/kubernetes-client that referenced this issue Feb 27, 2020
…er unavailable and relisting

+ relist when 410 is received
+ set HasSynced() to false when Reflector faces error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants