Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS RedHat - cluster networking issues/lags using canal and flannel plugins #1072

Closed
przemyslavic opened this issue Mar 26, 2020 · 8 comments · Fixed by #1484 or #1501
Closed

AWS RedHat - cluster networking issues/lags using canal and flannel plugins #1072

przemyslavic opened this issue Mar 26, 2020 · 8 comments · Fixed by #1484 or #1501

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Mar 26, 2020

Describe the bug
The curl command for k8s deployed services on AWS RHEL takes about 1-2 seconds using calico plugin while for canal and flannel it takes 1-3 minutes.

 [ec2-user@ec2-35-180-45-41 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k http://google.pl
 301
 real    0m0.044s
 user    0m0.002s
 sys     0m0.003s
 
 [ec2-user@ec2-35-180-45-41 ~]$ time curl 'http://127.0.0.1:32300/ignite?cmd=version'
 {"successStatus":0,"error":null,"sessionToken":null,"response":"2.5.0"}
 real    1m3.267s
 user    0m0.001s
 sys     0m0.009s
 
 [ec2-user@ec2-35-180-45-41 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-35-180-45-41:30104/auth/
 200
 real    1m3.468s
 user    0m0.043s
 sys     0m0.077s

To Reproduce
Steps to reproduce the behavior:

  1. execute epicli init ... (with params)
  2. edit config file
  3. execute epicli apply ...

Expected behavior
A clear and concise description of what you expected to happen.

Config files
Configuration that should be included in the yaml file:

  • PostgreSQL - at least 1 vm
---
kind: configuration/kubernetes-master
name: default
provider: aws
specification:
  advanced:
    networking:
      plugin: canal
---
kind: configuration/applications
title: "Kubernetes Applications Config"
provider: aws
name: default
specification:
  applications:
  - name: rabbitmq
    enabled: yes
    image_path: rabbitmq:3.7.10
    #image_pull_secret_name: regcred # optional
    service:
      name: rabbitmq-cluster
      port: 30672
      management_port: 31672
      replicas: 2
      namespace: queue
    rabbitmq:
      #amqp_port: 5672 #optional - default 5672
      plugins: # optional list of RabbitMQ plugins
        - rabbitmq_management_agent
        - rabbitmq_management
      policies: # optional list of RabbitMQ policies
        - name: ha-policy2
          pattern: ".*"
          definitions:
            ha-mode: all
      custom_configurations: #optional list of RabbitMQ configurations (new format -> https://www.rabbitmq.com/configure.html)
        - name: vm_memory_high_watermark.relative
          value: 0.5
      #cluster:
        #is_clustered: true #redundant in in-Kubernetes installation, it will always be clustered
        #cookie: "cookieSetFromDataYaml" #optional - default value will be random generated string
  - name: ignite-stateless
    enabled: yes
    image_path: "apacheignite/ignite:2.5.0" # it will be part of the image path: {{local_repository}}/{{image_path}}
    namespace: ignite
    service:
      rest_nodeport: 32300
      sql_nodeport: 32301
      thinclients_nodeport: 32302
    replicas: 2
    enabled_plugins:
    - ignite-kubernetes # required to work on K8s
    - ignite-rest-http
  - name: auth-service # requires PostgreSQL to be installed in cluster
    enabled: yes
    image_path: jboss/keycloak:9.0.0
    use_local_image_registry: true
    #image_pull_secret_name: regcred
    service:
      name: as-testauthdb
      port: 30104
      replicas: 2
      namespace: namespace-for-auth
      admin_user: auth-service-username
      admin_password: PASSWORD_TO_CHANGE
    database:
      name: auth-database-name
      #port: "5432" # leave it when default
      user: auth-db-user
      password: PASSWORD_TO_CHANGE

OS (please complete the following information):

  • OS: RHEL7

Cloud Environment (please complete the following information):

  • Cloud Provider AWS

Additional context

@mkyc
Copy link
Contributor

mkyc commented Jul 2, 2020

@przemyslavic can you confirm that is still existing one? It's bit old.

@jetalone85
Copy link
Contributor

It is maybe similar to: #1129, please check.

@toszo
Copy link
Contributor

toszo commented Jul 9, 2020

There is troubleshooting docs in Kubernetes wiki that you can find useful: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/

K8s Networking: https://kubernetes.io/docs/concepts/cluster-administration/networking/

@rpudlowski93 rpudlowski93 assigned rpudlowski93 and unassigned to-bar Jul 9, 2020
@rpudlowski93
Copy link
Contributor

rpudlowski93 commented Jul 9, 2020

The problem still occure and is repeatable in the following version:

OS: RHEL
Cloud Provider: AWS
Epiphany: v0.7.0
K8S: v1.17.7
Network plugin: Canal

[root@ec2-3-16-79-250 ~]# time curl -o /dev/null -s -w '%{http_code}' -k http://google.pl
301
real    0m0.082s
user    0m0.004s
sys     0m0.002s
[root@ec2-3-16-79-250 ~]# time curl 'http://127.0.0.1:32300/ignite?cmd=version'
{"successStatus":0,"error":null,"response":"2.5.0","sessionToken":null}
real    1m3.138s
user    0m0.002s
sys     0m0.004s
[root@ec2-3-16-79-250 ~]# time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-3-16-79-250:30104/auth/
200
real    1m3.299s
user    0m0.031s
sys     0m0.053s

Moreover, I did small investigation and in logs of Ignite deployment, I noticed something bad, namely:

[14:00:28,649][SEVERE][tcp-disco-ip-finder-cleaner-#4][TcpDiscoverySpi] Failed to clean IP finder up.
class org.apache.ignite.spi.IgniteSpiException: Failed to retrieve Ignite pods IP addresses.
        at org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder.getRegisteredAddresses(TcpDiscoveryKubernetesIpFinder.java:172)
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.registeredAddresses(TcpDiscoverySpi.java:1828)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$IpFinderCleaner.cleanIpFinder(ServerImpl.java:1938)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$IpFinderCleaner.body(ServerImpl.java:1913)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.net.UnknownHostException: kubernetes.default.svc.cluster.local

On the same configuration with Ubuntu OS the network issue doesn't occure and the logs looks correctly.

@rpudlowski93 rpudlowski93 assigned to-bar and unassigned rpudlowski93 Jul 9, 2020
@mkyc mkyc modified the milestones: 0.7.1, S20200729 Jul 17, 2020
@rpudlowski93
Copy link
Contributor

rpudlowski93 commented Jul 20, 2020

Update:
It looks like a kernel bug of Linux in RedHat version 7.6, 7.7, 7.8.

Reason:
Some kernel versions have issue with VXLAN checksum offloading - in our case 3.10.0-xxx kernel version.
VXLAN which use flannel and canal CNI plugins, use UDP protocol for packages transmission. Observed lag (63 sec.) is caused by errors in checksum and after 63 sec, flag "no cksum" is set.

Solution:
Workarounds:
ethtool --offload flannel.1 rx off tx off on every cluster node in order to disable checksum for flannel.
or:
sudo iptables -A OUTPUT -p udp -m udp --dport 8472 -j MARK --set-xmark 0x0 - on every cluster node.

Additional content:
Reported bug: flannel-io/flannel#1282
RFC recommends to not set the UDP checksum: https://tools.ietf.org/html/rfc7348
Problem doesn't affect ubuntu, because there is 5.3.0 kernel version which already has checksum patch.

@rafzei rafzei self-assigned this Jul 21, 2020
@rafzei
Copy link
Contributor

rafzei commented Jul 21, 2020

I would go with disabling only tx-checksum: ethtool -K flannel.1 tx-checksum-ip-generic off. Looks like its sufficient. We can do this 'live' and we can easily change it back.
The issue is already fixed in kubernetes/kubernetes#92035 - Changelog v1.18.5
We should upgrade k8s to at least 1.18.5 or latest(maybe 1.19 will be there) in Epiphany 0.8

@mkyc
Copy link
Contributor

mkyc commented Jul 21, 2020

Added task for this.

@to-bar
Copy link
Contributor

to-bar commented Jul 31, 2020

Upgrade K8s to v1.18.6 fixed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment