[BUG] [AWS] cluster networking issues using calico plugin - NodePort service not always responding #1129

przemyslavic · 2020-04-06T13:55:26Z

Describe the bug
Cluster networking issues using calico plugin - NodePort service not always responding.
Keycloak deployment on cluster is turn on. Timeout of response is random.

Logs

[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
200
real    0m0.065s
user    0m0.016s
sys     0m0.000s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
000
real    2m9.797s
user    0m0.006s
sys     0m0.006s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
200
real    0m0.045s
user    0m0.016s
sys     0m0.000s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
200
real    0m0.043s
user    0m0.015s
sys     0m0.000s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
200
real    0m0.045s
user    0m0.016s
sys     0m0.000s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
000
real    2m9.461s
user    0m0.012s
sys     0m0.000s

To Reproduce
Steps to reproduce the behavior:

Expected behavior

Config files

OS (please complete the following information):

Ubuntu

Cloud Environment (please complete the following information):

AWS

Additional context

The text was updated successfully, but these errors were encountered:

mkyc · 2020-07-02T11:34:16Z

@przemyslavic can you add steps to reproduce here?

seriva · 2020-07-07T07:31:39Z

So far I am not able to reproduce this on a AWS-Ubuntu-Callico, AWS-Ubuntu-Canal combo. Using the following config:

kind: epiphany-cluster
name: default
provider: aws
specification:
  admin_user:
    name: ubuntu
    key_path: /home/vscode/ssh/id_rsa_epi
  cloud:
    region: eu-west-3
    credentials: # todo change it to get credentials from vault
      key: blablabla
      secret: blablabla
    use_public_ips: true  
  components:
    kubernetes_master:
      count: 1
      machine: kubernetes-master-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.1.0/24
      - availability_zone: eu-west-3b
        address_pool: 10.1.2.0/24
    kubernetes_node:
      count: 2
      machine: kubernetes-node-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.1.0/24
      - availability_zone: eu-west-3b
        address_pool: 10.1.2.0/24
    logging:
      count: 1
      machine: logging-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.3.0/24
    monitoring:
      count: 1
      machine: monitoring-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.4.0/24
    kafka:
      count: 0
      machine: kafka-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.5.0/24
    postgresql:
      count: 1
      machine: postgresql-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.6.0/24
    load_balancer:
      count: 0
      machine: load-balancer-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.7.0/24
    rabbitmq:
      count: 0
      machine: rabbitmq-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.8.0/24
    ignite:
      count: 0
      machine: ignite-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.9.0/24
    opendistro_for_elasticsearch:
      count: 0
      machine: logging-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.10.0/24
    single_machine:
      count: 0
      machine: single-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.1.0/24
      - availability_zone: eu-west-3b
        address_pool: 10.1.2.0/24
  name: awsu
  prefix: 'test'
title: Epiphany cluster Config
---
kind: configuration/applications
title: Kubernetes Applications Config
name: default
specification:
  applications:
  - name: ignite-stateless
    enabled: no
    image_path: apacheignite/ignite:2.5.0
    namespace: ignite
    service:
      rest_nodeport: 32300
      sql_nodeport: 32301
      thinclients_nodeport: 32302
    replicas: 1
    enabled_plugins:
    - ignite-kubernetes
    - ignite-rest-http
  - name: rabbitmq
    enabled: no
    image_path: rabbitmq:3.7.10
    use_local_image_registry: true
    service:
      name: rabbitmq-cluster
      port: 30672
      management_port: 31672
      replicas: 2
      namespace: queue
    rabbitmq:
      plugins:
      - rabbitmq_management
      - rabbitmq_management_agent
      policies:
      - name: ha-policy2
        pattern: .*
        definitions:
          ha-mode: all
      custom_configurations:
      - name: vm_memory_high_watermark.relative
        value: 0.5
      cluster:
  - name: auth-service
    enabled: yes
    image_path: jboss/keycloak:9.0.0
    use_local_image_registry: true
    service:
      name: as-testauthdb
      port: 30104
      replicas: 2
      namespace: namespace-for-auth
      admin_user: auth-service-username
      admin_password: PASSWORD_TO_CHANGE
    database:
      name: auth-database-name
      user: auth-db-user
      password: PASSWORD_TO_CHANGE
  - name: pgpool
    enabled: no
    image:
      path: bitnami/pgpool:4.1.1-debian-10-r29
      debug: no
    namespace: postgres-pool
    service:
      name: pgpool
      port: 5432
    replicas: 3
    pod_spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - pgpool
              topologyKey: kubernetes.io/hostname
      nodeSelector: {}
      tolerations: {}
    resources:
      limits:
        memory: 176Mi
      requests:
        cpu: 250m
        memory: 176Mi
    pgpool:
      env:
        PGPOOL_POSTGRES_USERNAME: epi_pgpool_postgres_admin
        PGPOOL_SR_CHECK_USER: epi_pgpool_sr_check
        PGPOOL_ADMIN_USERNAME: epi_pgpool_admin
        PGPOOL_ENABLE_LOAD_BALANCING: yes
        PGPOOL_MAX_POOL: 4
        PGPOOL_POSTGRES_PASSWORD_FILE: /opt/bitnami/pgpool/secrets/pgpool_postgres_password
        PGPOOL_SR_CHECK_PASSWORD_FILE: /opt/bitnami/pgpool/secrets/pgpool_sr_check_password
        PGPOOL_ADMIN_PASSWORD_FILE: /opt/bitnami/pgpool/secrets/pgpool_admin_password
      secrets:
        pgpool_postgres_password: PASSWORD_TO_CHANGE
        pgpool_sr_check_password: PASSWORD_TO_CHANGE
        pgpool_admin_password: PASSWORD_TO_CHANGE
      pgpool_conf_content_to_append: |
        #------------------------------------------------------------------------------
        # CUSTOM SETTINGS (appended by Epiphany to override defaults)
        #------------------------------------------------------------------------------
        # num_init_children = 32
        connection_life_time = 900
        reserved_connections = 1
  - name: pgbouncer
    enabled: no
    image_path: brainsam/pgbouncer:1.12
    init_image_path: bitnami/pgpool:4.1.1-debian-10-r29
    namespace: postgres-pool
    service:
      name: pgbouncer
      port: 5432
    replicas: 2
    resources:
      requests:
        cpu: 250m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 128Mi
    pgbouncer:
      env:
        DB_HOST: pgpool.postgres-pool.svc.cluster.local
        DB_LISTEN_PORT: 5432
        LISTEN_ADDR: '*'
        LISTEN_PORT: 5432
        AUTH_FILE: /etc/pgbouncer/auth/users.txt
        AUTH_TYPE: md5
        MAX_CLIENT_CONN: 150
        DEFAULT_POOL_SIZE: 25
        RESERVE_POOL_SIZE: 25
        POOL_MODE: transaction
version: 0.7.0
provider: aws
---
kind: configuration/kubernetes-master
title: "Kubernetes Master Config"
name: default
provider: aws
specification:
  version: 1.17.4
  cluster_name: "kubernetes-epiphany"
  allow_pods_on_master: False
  storage:
    name: epiphany-cluster-volume # name of the Kubernetes resource
    path: / # directory path in mounted storage
    enable: True
    capacity: 50 # GB
    data: {} #AUTOMATED - data specific to cloud provider
  advanced: # modify only if you are sure what value means
    api_server_args: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
      profiling: false
      enable-admission-plugins: "AlwaysPullImages,DenyEscalatingExec,NamespaceLifecycle,ServiceAccount,NodeRestriction"
      audit-log-path: "/var/log/apiserver/audit.log"
      audit-log-maxbackup: 10
      audit-log-maxsize: 200
    controller_manager_args: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
      profiling: false
      terminated-pod-gc-threshold: 200
    scheduler_args:  # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/
      profiling: false
    networking:
      dnsDomain: cluster.local
      serviceSubnet: 10.96.0.0/12
      plugin: calico # valid options: calico, flannel, canal (due to lack of support for calico on Azure - use canal)
    imageRepository: k8s.gcr.io
    certificatesDir: /etc/kubernetes/pki
    etcd_args:
      encrypted: yes

Using the following script trying to reproduce it trying bothe from master -> node and node -> master:

while :
do
        START_TIME=$(date +%s%3N)
        OUTPUT=$(curl -o /dev/null -s -w '%{http_code}' -k https://node:30104/auth/ )
        ELAPSED_TIME=$(expr $(date +%s%3N) - $START_TIME)
        echo "Request httpcode: $OUTPUT, Time: $ELAPSED_TIME milliseconds"
        sleep 2
done

After running it for a few hours I see request times between 30 to 100ms but not 2 to 3 minutes that is reported by @przemyslavic. Issue seems similar to #1072.

seriva · 2020-07-07T15:24:38Z

Double checked on a 0.6.0 deployed cluster and I can reproduce it easily:

So it seems to be resolved with the latest updates done to the calico.

mkyc · 2020-07-07T18:15:15Z

Shall we add fix to 0.6.x branch?

toszo · 2020-07-09T08:08:53Z

Yes, please add fix to 0.6.x

seriva · 2020-07-09T10:23:32Z

So this should not be part of 0.7.1 but a 0.6.1 epic. Also I think its better to wait for 0.7.1 release before we backmerge since there are some fixes beeing made with K8s there.

mkyc · 2020-07-09T10:32:48Z

Sure, for now I created 0.6.1 milestone (without due date) and assigned this issue to it.

seriva · 2020-07-17T09:21:10Z

I put this in the blocked colomn for now since we need to await the 0.7.1 release.

mkyc · 2020-07-29T15:23:01Z

I moved it to correct 0.6.1 release and removed it from 0.6.1 milestone

mkyc · 2021-04-08T08:11:41Z

Add information to change log known issues section.

mkyc · 2021-04-08T09:16:18Z

Handled in this PR

przemyslavic added the type/bug label Apr 6, 2020

przemyslavic mentioned this issue Apr 6, 2020

[INVESTIGATION] Investigate if possible cluster networking issues initially noticed on AWS affect Azure #1130

Closed

mkyc added this to the 0.7.1 milestone Jul 2, 2020

mkyc added the status/grooming-needed label Jul 2, 2020

jetalone85 changed the title ~~[AWS] cluster networking issues using calico plugin - NodePort service not always responding~~ [BUG] [AWS] cluster networking issues using calico plugin - NodePort service not always responding Jul 3, 2020

jetalone85 removed the status/grooming-needed label Jul 3, 2020

seriva self-assigned this Jul 3, 2020

jetalone85 mentioned this issue Jul 3, 2020

AWS RedHat - cluster networking issues/lags using canal and flannel plugins #1072

Closed

mkyc added the type/management label Jul 9, 2020

mkyc modified the milestones: 0.7.1, 0.6.1 Jul 9, 2020

mkyc removed the type/management label Jul 9, 2020

mkyc removed this from the 0.6.1 milestone Jul 29, 2020

seriva removed their assignment Oct 9, 2020

mkyc added this to the S20210422 milestone Apr 8, 2021

mkyc self-assigned this Apr 8, 2021

mkyc closed this as completed Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [AWS] cluster networking issues using calico plugin - NodePort service not always responding #1129

[BUG] [AWS] cluster networking issues using calico plugin - NodePort service not always responding #1129

przemyslavic commented Apr 6, 2020 •

edited by jetalone85

Loading

mkyc commented Jul 2, 2020

seriva commented Jul 7, 2020 •

edited

Loading

seriva commented Jul 7, 2020

mkyc commented Jul 7, 2020

toszo commented Jul 9, 2020

seriva commented Jul 9, 2020

mkyc commented Jul 9, 2020

seriva commented Jul 17, 2020

mkyc commented Jul 29, 2020

mkyc commented Apr 8, 2021

mkyc commented Apr 8, 2021

[BUG] [AWS] cluster networking issues using calico plugin - NodePort service not always responding #1129

[BUG] [AWS] cluster networking issues using calico plugin - NodePort service not always responding #1129

Comments

przemyslavic commented Apr 6, 2020 • edited by jetalone85 Loading

mkyc commented Jul 2, 2020

seriva commented Jul 7, 2020 • edited Loading

seriva commented Jul 7, 2020

mkyc commented Jul 7, 2020

toszo commented Jul 9, 2020

seriva commented Jul 9, 2020

mkyc commented Jul 9, 2020

seriva commented Jul 17, 2020

mkyc commented Jul 29, 2020

mkyc commented Apr 8, 2021

mkyc commented Apr 8, 2021

przemyslavic commented Apr 6, 2020 •

edited by jetalone85

Loading

seriva commented Jul 7, 2020 •

edited

Loading