AWS RedHat - cluster networking issues/lags using canal and flannel plugins #1072

przemyslavic · 2020-03-26T09:17:08Z

Describe the bug
The curl command for k8s deployed services on AWS RHEL takes about 1-2 seconds using calico plugin while for canal and flannel it takes 1-3 minutes.

 [ec2-user@ec2-35-180-45-41 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k http://google.pl
 301
 real    0m0.044s
 user    0m0.002s
 sys     0m0.003s
 
 [ec2-user@ec2-35-180-45-41 ~]$ time curl 'http://127.0.0.1:32300/ignite?cmd=version'
 {"successStatus":0,"error":null,"sessionToken":null,"response":"2.5.0"}
 real    1m3.267s
 user    0m0.001s
 sys     0m0.009s
 
 [ec2-user@ec2-35-180-45-41 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-35-180-45-41:30104/auth/
 200
 real    1m3.468s
 user    0m0.043s
 sys     0m0.077s

To Reproduce
Steps to reproduce the behavior:

execute epicli init ... (with params)
edit config file
execute epicli apply ...

Expected behavior
A clear and concise description of what you expected to happen.

Config files
Configuration that should be included in the yaml file:

PostgreSQL - at least 1 vm

---
kind: configuration/kubernetes-master
name: default
provider: aws
specification:
  advanced:
    networking:
      plugin: canal

---
kind: configuration/applications
title: "Kubernetes Applications Config"
provider: aws
name: default
specification:
  applications:
  - name: rabbitmq
    enabled: yes
    image_path: rabbitmq:3.7.10
    #image_pull_secret_name: regcred # optional
    service:
      name: rabbitmq-cluster
      port: 30672
      management_port: 31672
      replicas: 2
      namespace: queue
    rabbitmq:
      #amqp_port: 5672 #optional - default 5672
      plugins: # optional list of RabbitMQ plugins
        - rabbitmq_management_agent
        - rabbitmq_management
      policies: # optional list of RabbitMQ policies
        - name: ha-policy2
          pattern: ".*"
          definitions:
            ha-mode: all
      custom_configurations: #optional list of RabbitMQ configurations (new format -> https://www.rabbitmq.com/configure.html)
        - name: vm_memory_high_watermark.relative
          value: 0.5
      #cluster:
        #is_clustered: true #redundant in in-Kubernetes installation, it will always be clustered
        #cookie: "cookieSetFromDataYaml" #optional - default value will be random generated string
  - name: ignite-stateless
    enabled: yes
    image_path: "apacheignite/ignite:2.5.0" # it will be part of the image path: {{local_repository}}/{{image_path}}
    namespace: ignite
    service:
      rest_nodeport: 32300
      sql_nodeport: 32301
      thinclients_nodeport: 32302
    replicas: 2
    enabled_plugins:
    - ignite-kubernetes # required to work on K8s
    - ignite-rest-http
  - name: auth-service # requires PostgreSQL to be installed in cluster
    enabled: yes
    image_path: jboss/keycloak:9.0.0
    use_local_image_registry: true
    #image_pull_secret_name: regcred
    service:
      name: as-testauthdb
      port: 30104
      replicas: 2
      namespace: namespace-for-auth
      admin_user: auth-service-username
      admin_password: PASSWORD_TO_CHANGE
    database:
      name: auth-database-name
      #port: "5432" # leave it when default
      user: auth-db-user
      password: PASSWORD_TO_CHANGE

OS (please complete the following information):

OS: RHEL7

Cloud Environment (please complete the following information):

Cloud Provider AWS

Additional context

The text was updated successfully, but these errors were encountered:

mkyc · 2020-07-02T11:40:51Z

@przemyslavic can you confirm that is still existing one? It's bit old.

jetalone85 · 2020-07-03T08:49:37Z

It is maybe similar to: #1129, please check.

toszo · 2020-07-09T10:19:14Z

There is troubleshooting docs in Kubernetes wiki that you can find useful: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/

K8s Networking: https://kubernetes.io/docs/concepts/cluster-administration/networking/

rpudlowski93 · 2020-07-09T14:18:53Z

The problem still occure and is repeatable in the following version:

OS: RHEL
Cloud Provider: AWS
Epiphany: v0.7.0
K8S: v1.17.7
Network plugin: Canal

[root@ec2-3-16-79-250 ~]# time curl -o /dev/null -s -w '%{http_code}' -k http://google.pl
301
real    0m0.082s
user    0m0.004s
sys     0m0.002s
[root@ec2-3-16-79-250 ~]# time curl 'http://127.0.0.1:32300/ignite?cmd=version'
{"successStatus":0,"error":null,"response":"2.5.0","sessionToken":null}
real    1m3.138s
user    0m0.002s
sys     0m0.004s
[root@ec2-3-16-79-250 ~]# time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-3-16-79-250:30104/auth/
200
real    1m3.299s
user    0m0.031s
sys     0m0.053s

Moreover, I did small investigation and in logs of Ignite deployment, I noticed something bad, namely:

[14:00:28,649][SEVERE][tcp-disco-ip-finder-cleaner-#4][TcpDiscoverySpi] Failed to clean IP finder up.
class org.apache.ignite.spi.IgniteSpiException: Failed to retrieve Ignite pods IP addresses.
        at org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder.getRegisteredAddresses(TcpDiscoveryKubernetesIpFinder.java:172)
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.registeredAddresses(TcpDiscoverySpi.java:1828)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$IpFinderCleaner.cleanIpFinder(ServerImpl.java:1938)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$IpFinderCleaner.body(ServerImpl.java:1913)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.net.UnknownHostException: kubernetes.default.svc.cluster.local

On the same configuration with Ubuntu OS the network issue doesn't occure and the logs looks correctly.

rpudlowski93 · 2020-07-20T14:26:10Z

Update:
It looks like a kernel bug of Linux in RedHat version 7.6, 7.7, 7.8.

Reason:
Some kernel versions have issue with VXLAN checksum offloading - in our case 3.10.0-xxx kernel version.
VXLAN which use flannel and canal CNI plugins, use UDP protocol for packages transmission. Observed lag (63 sec.) is caused by errors in checksum and after 63 sec, flag "no cksum" is set.

Solution:
Workarounds:
ethtool --offload flannel.1 rx off tx off on every cluster node in order to disable checksum for flannel.
or:
sudo iptables -A OUTPUT -p udp -m udp --dport 8472 -j MARK --set-xmark 0x0 - on every cluster node.

Additional content:
Reported bug: flannel-io/flannel#1282
RFC recommends to not set the UDP checksum: https://tools.ietf.org/html/rfc7348
Problem doesn't affect ubuntu, because there is 5.3.0 kernel version which already has checksum patch.

rafzei · 2020-07-21T08:51:25Z

I would go with disabling only tx-checksum: ethtool -K flannel.1 tx-checksum-ip-generic off. Looks like its sufficient. We can do this 'live' and we can easily change it back.
The issue is already fixed in kubernetes/kubernetes#92035 - Changelog v1.18.5
We should upgrade k8s to at least 1.18.5 or latest(maybe 1.19 will be there) in Epiphany 0.8

mkyc · 2020-07-21T09:23:02Z

Added task for this.

to-bar · 2020-07-31T11:42:54Z

Upgrade K8s to v1.18.6 fixed the issue.

przemyslavic added type/bug area/kubernetes labels Mar 26, 2020

przemyslavic mentioned this issue Apr 6, 2020

[INVESTIGATION] Investigate if possible cluster networking issues initially noticed on AWS affect Azure #1130

Closed

mkyc added this to the 0.7.1 milestone Jul 2, 2020

mkyc mentioned this issue Jul 2, 2020

[BUG] Vault installation fails when using canal/calico network plugin #1398

Closed

to-bar self-assigned this Jul 3, 2020

mkyc added the status/grooming-needed label Jul 3, 2020

pprach removed the status/grooming-needed label Jul 3, 2020

seriva mentioned this issue Jul 7, 2020

[BUG] [AWS] cluster networking issues using calico plugin - NodePort service not always responding #1129

Closed

rpudlowski93 assigned rpudlowski93 and unassigned to-bar Jul 9, 2020

rpudlowski93 assigned to-bar and unassigned rpudlowski93 Jul 9, 2020

plirglo assigned rpudlowski93 Jul 16, 2020

mkyc modified the milestones: 0.7.1, S20200729 Jul 17, 2020

rafzei self-assigned this Jul 21, 2020

mkyc mentioned this issue Jul 21, 2020

Upgrade K8s to v1.18.6 #1479

Closed

przemyslavic mentioned this issue Jul 22, 2020

[BUG] The RabbitMQ deployment - clustering is not working properly #1395

Closed

to-bar mentioned this issue Jul 23, 2020

[BUG] epicli upgrade does not upgrade flannel CNI plugin to v0.12.0 #1482

Closed

to-bar linked a pull request Jul 23, 2020 that will close this issue

Fix upgrade of flannel to v0.12.0 #1484

Merged

rpudlowski93 mentioned this issue Jul 24, 2020

[BUG] Ignite Nodes disconnecting every now and then #1415

Closed

przemyslavic closed this as completed in #1484 Jul 27, 2020

to-bar reopened this Jul 28, 2020

to-bar linked a pull request Jul 29, 2020 that will close this issue

Upgrade Kubernetes to v1.18.6 #1501

Merged

mkyc modified the milestones: S20200729, S20200813 Jul 29, 2020

to-bar closed this as completed in #1501 Jul 31, 2020

to-bar reopened this Jul 31, 2020

toszo closed this as completed Aug 4, 2020

seriva mentioned this issue Aug 11, 2020

Merging develop into 0.7.x branch #1549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS RedHat - cluster networking issues/lags using canal and flannel plugins #1072

AWS RedHat - cluster networking issues/lags using canal and flannel plugins #1072

przemyslavic commented Mar 26, 2020 •

edited

Loading

mkyc commented Jul 2, 2020

jetalone85 commented Jul 3, 2020

toszo commented Jul 9, 2020

rpudlowski93 commented Jul 9, 2020 •

edited

Loading

rpudlowski93 commented Jul 20, 2020 •

edited

Loading

rafzei commented Jul 21, 2020

mkyc commented Jul 21, 2020

to-bar commented Jul 31, 2020

AWS RedHat - cluster networking issues/lags using canal and flannel plugins #1072

AWS RedHat - cluster networking issues/lags using canal and flannel plugins #1072

Comments

przemyslavic commented Mar 26, 2020 • edited Loading

mkyc commented Jul 2, 2020

jetalone85 commented Jul 3, 2020

toszo commented Jul 9, 2020

rpudlowski93 commented Jul 9, 2020 • edited Loading

rpudlowski93 commented Jul 20, 2020 • edited Loading

rafzei commented Jul 21, 2020

mkyc commented Jul 21, 2020

to-bar commented Jul 31, 2020

przemyslavic commented Mar 26, 2020 •

edited

Loading

rpudlowski93 commented Jul 9, 2020 •

edited

Loading

rpudlowski93 commented Jul 20, 2020 •

edited

Loading