Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PolicyAPIClient.IncomingLink() may incorrectly report Policy validation error if executed during service-controller restart #753

Closed
hash-d opened this issue May 4, 2022 · 4 comments · Fixed by #782
Assignees

Comments

@hash-d
Copy link
Member

hash-d commented May 4, 2022

Description

During a full integration test on a cluster with Policy CRD and an all-allowing policy (all fields with '*' or true), the HipsterShop test failed with the output below:

=== RUN   TestHipsterShop
--- FAIL: TestHipsterShop (53.83s)
    hipstershop.go:60: Creating VANs
    hipstershop.go:83: setting siteConfigSpec to run with Ingress=none
    hipstershop.go:105: assertion failed: error is not nil: Policy validation error: command terminated with exit code 137: unable to create token to public1 cluster
    hipstershop_test.go:46: entering teardown

Looking at the code, PolicyAPIClient.IncomingLink() calls get on the service-controller container. Error 137 indicates a process that was killed with SIGKILL.

How to reproduce

I'll try to create a reproducer that uses the actual code. Getting the 137 return code should not be difficult. However, reproducing the exact situation from the HipsterShop that causes the pod to restart and generate the 137 might be impossible.

Meanwhile, I have reproduced only the 137 response with the following process:

  • Within the service-controller container, run an infinite loop calling get policies -h
    • (the loop must be within the container; do not put the podman exec on a loop, as the error will be different)
  • Remove the pod with something like oc delete pod skupper-service-controller-76d867f558-mm2vz
  • The loop will end with error 137

Sample output:

$ oc exec -ti skupper-service-controller-76d867f558-cbk6j -- bash
bash-4.4$ while true; do get policies -h > /dev/null; done 
command terminated with exit code 137
$ 

Suggestions

  • PolicyAPIClient.execGet() calls get twice: one with -h, and the other with the actual query. The fix needs to be done on both places, as either process may be the one killed during restart
  • The first call to get already handles return code 1. I'd suggest that any return codes above 128 be also handled, with a message that the process was killed by a signal
  • When that happens, the error should not be reported as a Policy error. It should probably be reported as a connectivity error.

With that said, should HipsterShop have failed in the first place, because of a restarting service-controller pod? Would it have, if policies were disabled? Perhaps, besides fixing the message, a further check should be done on how activating the policies impacts the handling of restarting pods by pre-existing pieces of the code.

Other notes

As a side note (as this is probably assert.Assert's fault, the error assertion failed: error is not nil: Policy validation error: command terminated with exit code 137: unable to create token to public1 cluster looks weird, in that it indicates that unable to create token to public1 cluster would be wrapped within command terminated with exit code 137, which is not the case.

I'm not sure whether that can be fixed, but it surely confuses things.

@fgiorgetti fgiorgetti self-assigned this May 4, 2022
@fgiorgetti
Copy link
Member

I will improve error handling here to present a meaningful error message, stating it is unable
to communicate with the Policy API.

Is this just happening while validating incoming links?

@hash-d
Copy link
Member Author

hash-d commented May 5, 2022

Is this just happening while validating incoming links?

I can't really tell. As far as I can tell, it happened only once, yesterday. It did not happen again today.

In any case, the root cause seems to be on PolicyAPIClient.execGet(), not on PolicyAPIClient.IncomingLink(), so I'd guess it could impact any of the policy items.

@hash-d
Copy link
Member Author

hash-d commented May 30, 2022

Another occurrence today, on a different test (tcp.TestIperf.skupper-iperf3-sites_1-size_1G-clients_1):

types.go:132: setting router debug options
types.go:167: Initializing Skupper at: public-iperf-1
types.go:294: Deploying iperf3-server
types.go:223: Waiting for router pod to be running
types.go:235: Buffering router logs for namespace 'public-iperf-1'
types.go:339: Exposing iperf3-server (service) through the Skupper network
types.go:346: assertion failed: error is not nil: Policy validation error: command terminated with exit code 137
types.go:252: Router logs buffer has been collected for namespace 'public-iperf-1': 144134 bytes
types.go:279: Tarball has been saved: /home/cloud-user/jenkins/workspace/skupper/policy-tests/skupper-integration-with-policy-downstream-latest-router-latest-ocp-latest-policy/skupper/test/integration/performance/tcp/tmp/skupper-iperf3-sites_1-size_1G-clients_1/public-iperf-1-skupper-router.log.tar.gz

fgiorgetti added a commit to fgiorgetti/skupper that referenced this issue Jun 1, 2022
hash-d added a commit to hash-d/skupper that referenced this issue Jun 1, 2022
Signed-off-by: Danilo Gonzalez Hashimoto <[email protected]>
fgiorgetti added a commit that referenced this issue Jun 2, 2022
* Handling controller restarts during policy validation

Fixes #753

* Updated retry interval for policy validation and improved error on timeout
nluaces pushed a commit that referenced this issue Jun 2, 2022
* Handling controller restarts during policy validation

Fixes #753

* Updated retry interval for policy validation and improved error on timeout
hash-d added a commit to hash-d/skupper that referenced this issue Jun 3, 2022
Signed-off-by: Danilo Gonzalez Hashimoto <[email protected]>
hash-d added a commit to hash-d/skupper that referenced this issue Jun 7, 2022
Signed-off-by: Danilo Gonzalez Hashimoto <[email protected]>
hash-d added a commit to hash-d/skupper that referenced this issue Jun 14, 2022
Signed-off-by: Danilo Gonzalez Hashimoto <[email protected]>
hash-d added a commit to hash-d/skupper that referenced this issue Jun 29, 2022
Signed-off-by: Danilo Gonzalez Hashimoto <[email protected]>
hash-d added a commit to hash-d/skupper that referenced this issue Jul 2, 2022
Signed-off-by: Danilo Gonzalez Hashimoto <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants