-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extended.[networking] services when using a plugin that isolates namespaces by default should allow connections to services in the default namespace from a pod in another namespace on the same node #14385
Comments
Seen very similar in #14853 https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1111/
Extended.[networking] services when using a plugin that isolates namespaces by default should allow connections from pods in the default namespace to a service in another namespace on a different node (from (networking_multitenant_true_01.xml))
|
missed this in the original report before:
ie, the pod only ran for 3 seconds, despite being logs from the latest flaked run show:
Not exactly sure what triggers "no route to host" as opposed to a normal connect timeout but it's definitely a thing that happens sometimes with service IP addresses... From the logs it looks like the iptables and OVS rules are both set up correctly. We don't do a readiness check on the webserver pod, but it got started 8 seconds before the wget command runs in this case so I don't think that's the problem. |
OK, looks like it happens if there is no iptables rule for the service IP; in that case OVS will forward the packet to the node via tun0, but the node won't have an iptables rule to rewrite it, and so then the node's routing table says that 172.30.0.0/16 should be forwarded to tun0, but that's where the packet came from, so it won't re-route it there, so it will respond with an ICMP error. But the logs seem to show that the iptables rule was there... |
Automatic merge from submit-queue (batch tested with PRs 15942, 15940, 15957, 15858, 15946) Make service e2e tests retry to avoid flakes This is an attempt to fix #14385; given that our tests tend to flake but the upstream service tests don't, it seems like we should make our tests more like theirs. So this replaces our `checkConnectivityToHost` code with code mostly copied from the upstream `execSourceipTest` (which, among other things, retries on failure until the timeout is reached). There are actually a lot of changes we could make to our tests to use new upstream code, but I wanted to keep this simple for now to avoid introducing new flakes. Fixes #14385 (hopefully)
|
Automatic merge from submit-queue. Try to collect some debug info if e2e service test fails Attempting to debug #14385; the logs don't show anything suspicious so let's try to get more logs @knobunc One thing I noticed is that the test is using the hybrid proxier, so if you have any ideas on possible failure modes with that (or additional debug commands that might be useful with it), let me know.
OK, in https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/test_branch_origin_extended_networking/1051/, the test fails, and we see that the expected iptables rules for the service don't exist. In fact, weirdly, there is no rule at all for the service (neither a functioning redirect rule nor a "no endpoints -> -j REJECT" rule). From the node-1 logs:
The node-2 logs look similar but don't have the "Not saving endpoints for unknown healthcheck" error. It looks like this might be a bad interaction between the roughly-simultaneous-in-different-goroutines OnServiceAdd and OnEndpointsAdd routines. |
@danwinship I've seen this happen a few times recently, I'm assuming this isn't "new"? |
Hm... #18258 shows something different actually. It looks like it's probably the kernel-namespaces-leaking-into-random-goroutines bug discussed in #15991. The logs are full of "impossible" networking errors, like the KUBE-SERVICES table temporarily not existing when trying to add the rules for this service:
and lots of
I'm pretty sure I didn't see any evidence of #15991 in the logs the last time I looked at one of these flakes though. Though it seems weird that there would be two completely unrelated causes of the same flake (and we had even suspected #15991 of being the cause of this flake before), so it's possible that I just missed it before. The good news is that the golang 1.10 runtime fixes this (golang/go#20676). |
Automatic merge from submit-queue (batch tested with PRs 18376, 18355). Move pod-namespace calls out of process As discussed in #15991, we need to move all operations in the pod's network namespace out of process, due to a golang issue that allows setns() calls in a locked thread to leak into other threads, causing random lossage as operations intended for the main network namespace end up running in other namespaces instead. (This is fixed in golang 1.10 but we need a fix before then.) Fixes #15991 Fixes #14385 Fixes #13108 Fixes #18317
https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/816/
The text was updated successfully, but these errors were encountered: