-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with delay between HTTP Get and response more than ~26 second #2364
Comments
These are all the same connection (same gateway, same cluster, same remote IP etc.) — I’m guessing you have multiple contexts defined in your kubeconfig; |
Yes, you're right, the number of connections is equivalent to the number of contexts. |
There is only one connection, you’re seeing it reported multiple times through the different contexts. Since your traffic goes through this connection (an IPsec tunnel), there can be an impact on the transmission, although since your average RTT is very low (0.5ms) I wouldn’t expect it to be noticeable. I also wouldn’t expect packets to be delivered out-of-order or duplicated, since the IPsec protocol goes to great lengths to avoid that. Perhaps there’s something going wrong between the nodes and the gateway; @sridhargaddam, @aswinsuryan, what do you think? |
There is a video how our issue is looks like. See attachment. submariner.mp4 |
This is weird. Can you share the following details? Environment:
Also, the CNI that you are using in your setup. Note: While running the above commands, you can pass the admin context, otherwise the same information is captured with ALL the contexts. For example: |
You can find requested information it the test file in attachment. We use hardware for our OpenShift deployment. It's shown on diagram amsoce02dr-is01.txt |
Thank you for sharing the details. The output of "subctl diagnose all" does not show any errors or warnings. For example:
For each cluster, please tar one of the directory contents of the folders shown in the output of "subctl gather" and attach them to this issue. Also, may i know the OpenShift version you are using in your setup? |
+1 to what @sridhargaddam suggested. In addition, from the pcap files attached above, it seems that TCP MSS values are asymmetric (1410 and 1160) , I would also recommend making sure that you are not running into any MTU related issue. for this you can run subctl verify with small packet size (check [1]), this option is available in latest subctl version. [1] |
Following is the command you executed.
As you can see, the context names are the same, hence you are getting the error. Please rename the context of the second cluster from admin to some other name and then re-run the command. Following is the sample output of contexts from a KIND setup.
|
This is proper connectivity test between two contexts in deferent cluster |
@sridhargaddam do you have chanse to see the logs file with connection test? |
@sridhargaddam Please let me know if we can aggange call to troubleshoot our issue with submariner online ? |
From the attached logs, I can see 5 failures.
Gateway to Gateway connectivity is working fine and all the failures are when the pod is scheduled on the Note: The
If the output of the above command is success and all the tests PASS, then it implies that there is indeed an MTU issue. |
Looks like you are not using the right version of subctl. Please follow the instructions I shared above on how to download the subctl version 0.15.0-rc0 and use the same version to run the verify command. |
Okay, so its indeed an MTU issue. It looks like the underlay network connecting the clusters are adding some protocol overhead and the default MTU configured on the interfaces is not sufficient. Using standard tools, please check the proper MTU between a non-Gateway node to a remote cluster non-Gateway node and once you derive the value, you can force the MTU as shown below. Add the following annotation on the Gateway nodes of both the clusters. After adding the annotation, restart the submariner-routeagent pods.
This should fix the latency issue you are seeing with TCP packets. |
@sridhargaddam I suppose issue Gateway nodes or libreswan. Please have a look to the traceroute screenshots |
@surfinlemex Did you annotate the GW nodes on both clusters with the desired TCP_MSS value as Sridhar mentioned? you should apply kubectl annotate node <gw_node_name> submariner.io/tcp-clamp-mss=1200 On GWs in both clusters. and then restart all route_agent pods by running: If the above steps didn't help please attach the latest Submariner logs (subctl gather ) |
Thanks for uploading the pcap files @surfinlemex , I'll try to check them in the next few days |
Could you please add [1] iptables rules on non-gw and gw nodes on both clusters and rerun the curl test ? You can use the following steps to install iptables rules on node: [1] iptables -t raw -I OUTPUT -s -d -j NOTRACK [2] $ kubectl get pods -n submariner-operator -o wide [3] |
Thanks @surfinlemex . Trying to summarize the case : What's the problem?
What did you try so far?
|
@yboaron nothing was help... TCP-MSS 1000 or NOTRACK for iptables... I have only one option... this issue related to CNI=OpenShift SDN = OVS. If you check captured traffic you can see intercluster traffic packets on external NIC. |
@surfinlemex I strongly suspect this is some platform configuration issue. Lets try to narrow down the problem.
Command:
After executing the above command, you can try the curl command once again and let us know the behavior. |
@sridhargaddam Looks like the change you suggested resolve the issue |
@surfinlemex that's great to hear!! Can you let us know the following details from the BareMetal platform
|
1. OS details: 2. Linux kernel: 3. OpenShift version: 4.Iptables version: |
@yboaron Can you please provide RCA of the issue we have? |
Well, I can see multiple UDP packets with bad checksum captured in the pcap files you attached [1], I assume that the root cause is some infrastructure issue (kernel or NIC firmware) causing bad checksum calculation for UDP packets as multiple similar issues have also been reported in other projects[2]. [2] |
This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions. |
I think @yboaron said he hopes to take a look at this when he has cycles |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further |
In ours two cluster deployment we can observer several same submariner connections.
All of the is Active and digagnoses wasn't detect any issues.
However we have issue with TCP session delay for connections between two POD in diferent clusters.
Can you please let's us know if it's correct configuration when we can see more than one connection as it shown on screenshot in attachment?
Can we somehow delete extra connections?
The text was updated successfully, but these errors were encountered: