Two SharedInformer issues related to kube-apiserver unavailable and relisting #1961

kolorful · 2020-01-22T13:22:07Z

Previous discussed in #1943

I encountered two issues related to shared informer (v4.7.0) when conducting destructive testing on our Kubernetes cluster.

The first issue is shared informer won't relist automatically after seeing 410_GONE. How to reproduce:

start a shared informer for Job and launch a Job
stop kube-apiserver process
restart kube-apiserver process after a few seconds
you shall see the same error keep showing up in the logs and informer always returns true for hasSynced(), but the cache won't get any update until the next re-list.

Is it possible to let shared informer detect such issue and start a re-list right away?

On the other hand, the error seems like it falls into a wrong if condition block.

Jan 15 05:40:08: [OkHttp https://127.0.0.1:6443/...] ERROR io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Could not deserialize watch event: {"type":"ERROR","object":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 544015 (544921)","reason":"Gone","code":410}}
Jan 15 05:40:08: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot construct instance of `io.fabric8.kubernetes.api.model.NodeStatus` (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('Failure')
Jan 15 05:40:08: at [Source: (String)"{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"too old resource version: 544015 (544921)","reason":"Gone","code":410}"; line: 1, column: 59] (through reference chain: io.fabric8.kubernetes.api.model.Node["status"])
Jan 15 05:40:08: at com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:63)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1429)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.DeserializationContext.handleMissingInstantiator(DeserializationContext.java:1059)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.ValueInstantiator._createFromStringFallbacks(ValueInstantiator.java:371)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromString(StdValueInstantiator.java:323)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromString(BeanDeserializerBase.java:1373)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeOther(BeanDeserializer.java:171)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:161)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:288)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4202)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3205)
Jan 15 05:40:08: at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3173)
Jan 15 05:40:08: at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:279)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
Jan 15 05:40:08: at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
Jan 15 05:40:08: at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
Jan 15 05:40:08: at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
Jan 15 05:40:08: at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
Jan 15 05:40:08: at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
Jan 15 05:40:08: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jan 15 05:40:08: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jan 15 05:40:08: at java.lang.Thread.run(Thread.java:745)

The error comes from this line:

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java

Line 279 in b595e42

    
           watcher.eventReceived(Watcher.Action.valueOf(watchEventType), mapper.readValue(watchObjectAsString, baseOperation.getType()));

I think it somehow mistaken Status as CRD and fall into the wrong logic block.

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java

Line 259 in b595e42

if (status.getCode() == HTTP_GONE) {

should in This block is suppose to handle it 410 correctly?

The second issue I observe is that informer's hasSynced() always returns true even after re-sync period is passed and re-sync failed.

How to reproduce:

start a shared informer for Job and launch a Job and keep checking if informer is cached
stop kube-apiserver process
wait for the re-sync period to pass

What I expect: re-list should fail due to kube-apiserver unavailable and hasSynced() returns false.
What actually happens: hasSynced() keep returning true without the cache being updated and client has no way to catch this error and restart the informers.

Is this expected behaviour? It seems like in client-go you can pass a stop channel to the informer and if anything went wrong client can catch that and try restarting the informers, but we don't have that ability in java yet.

The text was updated successfully, but these errors were encountered:

rohanKanojia · 2020-02-12T07:15:14Z

@kolorful : Hi, I've started working on this issue. Could you please share an example of a stop channel to the informer in go client you mentioned in last paragraph? I think it will be a good addition and I can try integrating it alongwith fixing these two bugs.

kolorful · 2020-02-12T15:55:27Z

Thank you @rohanKanojia, here is an example from sample-controller.

…er unavailable and relisting + relist when 410 is received + set HasSynced() to false when Reflector faces error

kolorful mentioned this issue Jan 22, 2020

Watch Connection leak when HTTP_GONE happens #1943

Closed

kolorful changed the title ~~Two SharedInformer relist issues~~ Two SharedInformer issues related to kube-apiserver outage and relisting Jan 22, 2020

kolorful changed the title ~~Two SharedInformer issues related to kube-apiserver outage and relisting~~ Two SharedInformer issues related to kube-apiserver unavailable and relisting Jan 22, 2020

rohanKanojia self-assigned this Jan 29, 2020

manusa mentioned this issue Feb 18, 2020

Refactor SharedInformers #2010

Closed

rohanKanojia mentioned this issue Feb 25, 2020

Fix #1961: Two SharedInformer issues related to kube-apiserver unavailable and relisting #2022

Merged

fusesource-ci closed this as completed in #2022 Mar 4, 2020

shawkins mentioned this issue May 7, 2021

Informer.hasSynced and Store.isPopulated #3090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two SharedInformer issues related to kube-apiserver unavailable and relisting #1961

Two SharedInformer issues related to kube-apiserver unavailable and relisting #1961

kolorful commented Jan 22, 2020 •

edited

Loading

rohanKanojia commented Feb 12, 2020

kolorful commented Feb 12, 2020

Two SharedInformer issues related to kube-apiserver unavailable and relisting #1961

Two SharedInformer issues related to kube-apiserver unavailable and relisting #1961

Comments

kolorful commented Jan 22, 2020 • edited Loading

rohanKanojia commented Feb 12, 2020

kolorful commented Feb 12, 2020

kolorful commented Jan 22, 2020 •

edited

Loading