Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [AWS] cluster networking issues using calico plugin - NodePort service not always responding #1129

Closed
przemyslavic opened this issue Apr 6, 2020 · 11 comments
Assignees
Labels
Milestone

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Apr 6, 2020

Describe the bug
Cluster networking issues using calico plugin - NodePort service not always responding.
Keycloak deployment on cluster is turn on. Timeout of response is random.

Logs

[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
200
real    0m0.065s
user    0m0.016s
sys     0m0.000s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
000
real    2m9.797s
user    0m0.006s
sys     0m0.006s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
200
real    0m0.045s
user    0m0.016s
sys     0m0.000s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
200
real    0m0.043s
user    0m0.015s
sys     0m0.000s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
200
real    0m0.045s
user    0m0.016s
sys     0m0.000s
[ubuntu@ec2-34-244-130-8 ~]$ time curl -o /dev/null -s -w '%{http_code}' -k https://ec2-34-244-189-100:30104/auth/
000
real    2m9.461s
user    0m0.012s
sys     0m0.000s

To Reproduce
Steps to reproduce the behavior:

Expected behavior

Config files

OS (please complete the following information):

  • Ubuntu

Cloud Environment (please complete the following information):

  • AWS

Additional context

@mkyc
Copy link
Contributor

mkyc commented Jul 2, 2020

@przemyslavic can you add steps to reproduce here?

@jetalone85 jetalone85 changed the title [AWS] cluster networking issues using calico plugin - NodePort service not always responding [BUG] [AWS] cluster networking issues using calico plugin - NodePort service not always responding Jul 3, 2020
@seriva seriva self-assigned this Jul 3, 2020
@seriva
Copy link
Collaborator

seriva commented Jul 7, 2020

So far I am not able to reproduce this on a AWS-Ubuntu-Callico, AWS-Ubuntu-Canal combo. Using the following config:

kind: epiphany-cluster
name: default
provider: aws
specification:
  admin_user:
    name: ubuntu
    key_path: /home/vscode/ssh/id_rsa_epi
  cloud:
    region: eu-west-3
    credentials: # todo change it to get credentials from vault
      key: blablabla
      secret: blablabla
    use_public_ips: true  
  components:
    kubernetes_master:
      count: 1
      machine: kubernetes-master-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.1.0/24
      - availability_zone: eu-west-3b
        address_pool: 10.1.2.0/24
    kubernetes_node:
      count: 2
      machine: kubernetes-node-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.1.0/24
      - availability_zone: eu-west-3b
        address_pool: 10.1.2.0/24
    logging:
      count: 1
      machine: logging-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.3.0/24
    monitoring:
      count: 1
      machine: monitoring-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.4.0/24
    kafka:
      count: 0
      machine: kafka-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.5.0/24
    postgresql:
      count: 1
      machine: postgresql-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.6.0/24
    load_balancer:
      count: 0
      machine: load-balancer-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.7.0/24
    rabbitmq:
      count: 0
      machine: rabbitmq-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.8.0/24
    ignite:
      count: 0
      machine: ignite-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.9.0/24
    opendistro_for_elasticsearch:
      count: 0
      machine: logging-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.10.0/24
    single_machine:
      count: 0
      machine: single-machine
      configuration: default
      subnets:
      - availability_zone: eu-west-3a
        address_pool: 10.1.1.0/24
      - availability_zone: eu-west-3b
        address_pool: 10.1.2.0/24
  name: awsu
  prefix: 'test'
title: Epiphany cluster Config
---
kind: configuration/applications
title: Kubernetes Applications Config
name: default
specification:
  applications:
  - name: ignite-stateless
    enabled: no
    image_path: apacheignite/ignite:2.5.0
    namespace: ignite
    service:
      rest_nodeport: 32300
      sql_nodeport: 32301
      thinclients_nodeport: 32302
    replicas: 1
    enabled_plugins:
    - ignite-kubernetes
    - ignite-rest-http
  - name: rabbitmq
    enabled: no
    image_path: rabbitmq:3.7.10
    use_local_image_registry: true
    service:
      name: rabbitmq-cluster
      port: 30672
      management_port: 31672
      replicas: 2
      namespace: queue
    rabbitmq:
      plugins:
      - rabbitmq_management
      - rabbitmq_management_agent
      policies:
      - name: ha-policy2
        pattern: .*
        definitions:
          ha-mode: all
      custom_configurations:
      - name: vm_memory_high_watermark.relative
        value: 0.5
      cluster:
  - name: auth-service
    enabled: yes
    image_path: jboss/keycloak:9.0.0
    use_local_image_registry: true
    service:
      name: as-testauthdb
      port: 30104
      replicas: 2
      namespace: namespace-for-auth
      admin_user: auth-service-username
      admin_password: PASSWORD_TO_CHANGE
    database:
      name: auth-database-name
      user: auth-db-user
      password: PASSWORD_TO_CHANGE
  - name: pgpool
    enabled: no
    image:
      path: bitnami/pgpool:4.1.1-debian-10-r29
      debug: no
    namespace: postgres-pool
    service:
      name: pgpool
      port: 5432
    replicas: 3
    pod_spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - pgpool
              topologyKey: kubernetes.io/hostname
      nodeSelector: {}
      tolerations: {}
    resources:
      limits:
        memory: 176Mi
      requests:
        cpu: 250m
        memory: 176Mi
    pgpool:
      env:
        PGPOOL_POSTGRES_USERNAME: epi_pgpool_postgres_admin
        PGPOOL_SR_CHECK_USER: epi_pgpool_sr_check
        PGPOOL_ADMIN_USERNAME: epi_pgpool_admin
        PGPOOL_ENABLE_LOAD_BALANCING: yes
        PGPOOL_MAX_POOL: 4
        PGPOOL_POSTGRES_PASSWORD_FILE: /opt/bitnami/pgpool/secrets/pgpool_postgres_password
        PGPOOL_SR_CHECK_PASSWORD_FILE: /opt/bitnami/pgpool/secrets/pgpool_sr_check_password
        PGPOOL_ADMIN_PASSWORD_FILE: /opt/bitnami/pgpool/secrets/pgpool_admin_password
      secrets:
        pgpool_postgres_password: PASSWORD_TO_CHANGE
        pgpool_sr_check_password: PASSWORD_TO_CHANGE
        pgpool_admin_password: PASSWORD_TO_CHANGE
      pgpool_conf_content_to_append: |
        #------------------------------------------------------------------------------
        # CUSTOM SETTINGS (appended by Epiphany to override defaults)
        #------------------------------------------------------------------------------
        # num_init_children = 32
        connection_life_time = 900
        reserved_connections = 1
  - name: pgbouncer
    enabled: no
    image_path: brainsam/pgbouncer:1.12
    init_image_path: bitnami/pgpool:4.1.1-debian-10-r29
    namespace: postgres-pool
    service:
      name: pgbouncer
      port: 5432
    replicas: 2
    resources:
      requests:
        cpu: 250m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 128Mi
    pgbouncer:
      env:
        DB_HOST: pgpool.postgres-pool.svc.cluster.local
        DB_LISTEN_PORT: 5432
        LISTEN_ADDR: '*'
        LISTEN_PORT: 5432
        AUTH_FILE: /etc/pgbouncer/auth/users.txt
        AUTH_TYPE: md5
        MAX_CLIENT_CONN: 150
        DEFAULT_POOL_SIZE: 25
        RESERVE_POOL_SIZE: 25
        POOL_MODE: transaction
version: 0.7.0
provider: aws
---
kind: configuration/kubernetes-master
title: "Kubernetes Master Config"
name: default
provider: aws
specification:
  version: 1.17.4
  cluster_name: "kubernetes-epiphany"
  allow_pods_on_master: False
  storage:
    name: epiphany-cluster-volume # name of the Kubernetes resource
    path: / # directory path in mounted storage
    enable: True
    capacity: 50 # GB
    data: {} #AUTOMATED - data specific to cloud provider
  advanced: # modify only if you are sure what value means
    api_server_args: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
      profiling: false
      enable-admission-plugins: "AlwaysPullImages,DenyEscalatingExec,NamespaceLifecycle,ServiceAccount,NodeRestriction"
      audit-log-path: "/var/log/apiserver/audit.log"
      audit-log-maxbackup: 10
      audit-log-maxsize: 200
    controller_manager_args: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
      profiling: false
      terminated-pod-gc-threshold: 200
    scheduler_args:  # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/
      profiling: false
    networking:
      dnsDomain: cluster.local
      serviceSubnet: 10.96.0.0/12
      plugin: calico # valid options: calico, flannel, canal (due to lack of support for calico on Azure - use canal)
    imageRepository: k8s.gcr.io
    certificatesDir: /etc/kubernetes/pki
    etcd_args:
      encrypted: yes

Using the following script trying to reproduce it trying bothe from master -> node and node -> master:

while :
do
        START_TIME=$(date +%s%3N)
        OUTPUT=$(curl -o /dev/null -s -w '%{http_code}' -k https://node:30104/auth/ )
        ELAPSED_TIME=$(expr $(date +%s%3N) - $START_TIME)
        echo "Request httpcode: $OUTPUT, Time: $ELAPSED_TIME milliseconds"
        sleep 2
done

After running it for a few hours I see request times between 30 to 100ms but not 2 to 3 minutes that is reported by @przemyslavic. Issue seems similar to #1072.

@seriva
Copy link
Collaborator

seriva commented Jul 7, 2020

Double checked on a 0.6.0 deployed cluster and I can reproduce it easily:

image

So it seems to be resolved with the latest updates done to the calico.

@mkyc
Copy link
Contributor

mkyc commented Jul 7, 2020

Shall we add fix to 0.6.x branch?

@toszo
Copy link
Contributor

toszo commented Jul 9, 2020

Yes, please add fix to 0.6.x

@seriva
Copy link
Collaborator

seriva commented Jul 9, 2020

So this should not be part of 0.7.1 but a 0.6.1 epic. Also I think its better to wait for 0.7.1 release before we backmerge since there are some fixes beeing made with K8s there.

@mkyc mkyc modified the milestones: 0.7.1, 0.6.1 Jul 9, 2020
@mkyc
Copy link
Contributor

mkyc commented Jul 9, 2020

Sure, for now I created 0.6.1 milestone (without due date) and assigned this issue to it.

@seriva
Copy link
Collaborator

seriva commented Jul 17, 2020

I put this in the blocked colomn for now since we need to await the 0.7.1 release.

@mkyc mkyc removed this from the 0.6.1 milestone Jul 29, 2020
@mkyc
Copy link
Contributor

mkyc commented Jul 29, 2020

I moved it to correct 0.6.1 release and removed it from 0.6.1 milestone

@seriva seriva removed their assignment Oct 9, 2020
@mkyc
Copy link
Contributor

mkyc commented Apr 8, 2021

Add information to change log known issues section.

@mkyc mkyc added this to the S20210422 milestone Apr 8, 2021
@mkyc mkyc self-assigned this Apr 8, 2021
@mkyc
Copy link
Contributor

mkyc commented Apr 8, 2021

Handled in this PR

@mkyc mkyc closed this as completed Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants