Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Research and confirm which Epiphany components scale down correctly/incorrectly #1497

Closed
rafzei opened this issue Jul 28, 2020 · 4 comments
Assignees
Labels
area/docs Area for documentation improvement and addition type/bug
Milestone

Comments

@rafzei
Copy link
Contributor

rafzei commented Jul 28, 2020

Is your feature request related to a problem? Please describe.
We don't really know which Epiphany components scale down correctly and which are not. It's not documented anywhere.

Describe the solution you'd like:
This spike should help us to provide well described CLUSTER documentation. We should also know what has to be done to provide full autoscaling.
After the spike, we should create new issues to improve/implement autoscaling.

@rafzei rafzei added this to the S20200729 milestone Jul 28, 2020
@rafzei rafzei added the area/docs Area for documentation improvement and addition label Jul 29, 2020
@mkyc mkyc modified the milestones: S20200729, S20200813 Jul 29, 2020
@mkyc
Copy link
Contributor

mkyc commented Aug 11, 2020

related to #1051

@mkyc mkyc modified the milestones: S20200813, S20200827 Aug 13, 2020
@atsikham atsikham self-assigned this Aug 17, 2020
@atsikham
Copy link
Contributor

Results are described in the next comment

@atsikham
Copy link
Contributor

atsikham commented Aug 20, 2020

Scaling status

This document describes results of upscale/downscale processes for main Epiphany components.

Prerequisites

  • HEAD is 379fb2c6b2b826db047339e6107edb16e07bfa78
  • Azure platform

Kubernetes master

Upscale

Supported.

Downscale

Not supported with a log entry:

ERROR epicli -ControlPlane downscale is not supported yet. Please revert your 'kubernetes_master' count to previous value or increase it to scale up kubernetes.

Summary

Need to add support for downscale.

Tested scheme: 1->3.

Kubernetes node

Upscale

Upscale is ok, checked nodes list and status via kubectl.

Downscale

  • No drain part, just removal - nothing in codebase, described in documentation.
  • No any disks removed.

Summary

Need to add support for downscale.

Tested scheme: 1->3->0.

Logging

Upscale

As expected, but sometimes the number of retries is not enough for kibana : Wait for kibana to be ready step.

Downscale

  • Disks are not removed.
  • Often the number of retries is not enough for kibana : Wait for kibana to be ready step.

Summary

Step kibana : Wait for kibana to be ready often fails after all retries.

See opendistro part, elasticsearch cluster runs as expected. Additionally it was checked that kibana, filebeat, prometheus-node-exporter services are active.

As in other components, disks should be removed after downscale.

Tested scheme: 1->2->3->1->0.

Monitoring

Upscale

Supported, it was verified that prometheus, grafana-server, prometheus-node-exporter, filebeat services are active.

Downscale

Disks are not removed.

Summary

Disks should be removed after downscale.

Tested scheme: 1->3->2->0.

Kafka

Verification commands

systemctl status kafka
systemctl status zookeeper

# list brockers
echo dump | nc localhost 2181 - check available brokers
/opt/zookeeper-3.4.12/bin/zkCli.sh -server localhost:2181 ls /brokers/ids

# list topics
/opt/kafka_2.12-2.3.1/bin/kafka-topics.sh --zookeeper localhost:2181 --list

# create a topic
/opt/kafka_2.12-2.3.1/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic1

Set replication factor greater than number of available brokers to see number of available brokers.

ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1. (kafka.admin.TopicCommand$)

Configuration files

  • /opt/kafka/config/server.properties
  • /opt/zookeeper/conf/zoo.cfg

Results of 1->3->2 scale

  • Ok: broker ids in /opt/kafka/config/server.properties are unique

  • Ok: when topic is created on one node, it's available on all other nodes

  • Not ok: after scaling up from 1 to 3 nodes there are all services started, but only 2 kafka broker ids available, no previously created topics are available, not possible to create a topic with replication factor greater than 2

  • Not ok: after scaling down from 3 to 2 nodes only 1 broker is available and topics list is not the same on different nodes

    root@atsikham-scale-test-kafka-vm-1:~# /opt/kafka_2.12-2.3.1/bin/kafka-topics.sh --zookeeper localhost:2181 --list
    topic2
    
    root@atsikham-scale-test-kafka-vm-0:~# /opt/kafka_2.12-2.3.1/bin/kafka-topics.sh --zookeeper localhost:2181 --list
    topic1
    topic2
    
  • Not ok: disks are not removed after downscale

Summary

Scaling up/down processes do not work as expected. Disks should be removed after downscale.

PostgreSQL

Verification commands

ln -s /etc/postgresql/10/main/repmgr.conf /etc/repmgr.conf
sudo -i -u postgres
/usr/lib/postgresql/10/bin/repmgr cluster show
/usr/lib/postgresql/10/bin/repmgr cluster crosscheck

Scheme: 1 (no replication) ->3->2->3->0

  • After scaling up from 1 node with no replication to 3 replicated nodes with following configuration changes

    ---
    kind: configuration/postgresql
    title: PostgreSQL
    name: default
    provider: azure
    specification:
      extensions:
        replication:
          enabled: true

​ 1 node is not joined to the cluster:

[postgres@atsikham-scale-test-postgresql-vm-0 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster show
ID | Name                                | Role    | Status    | Upstream                            | Location | Connection string
----+-------------------------------------+---------+-----------+-------------------------------------+----------+-------------------------------------------------------------------
1  | atsikham-scale-test-postgresql-vm-0 | primary | * running |                                     | default  | host=10.1.6.4 user=epi_repmgr dbname=epi_repmgr connect_timeout=2
2  | atsikham-scale-test-postgresql-vm-1 | standby |   running | atsikham-scale-test-postgresql-vm-0 | default  | host=10.1.6.6 user=epi_repmgr dbname=epi_repmgr connect_timeout=2

[postgres@atsikham-scale-test-postgresql-vm-0 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster crosscheck
                               Name | Id |  1 |  2
------------------------------------+----+----+----
atsikham-scale-test-postgresql-vm-0 |  1 |  ? |  ?
atsikham-scale-test-postgresql-vm-1 |  2 |  ? |  ?


[postgres@atsikham-scale-test-postgresql-vm-2 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster show
ERROR: unable to retrieve node records
DETAIL: ERROR:  relation "repmgr.nodes" does not exist
LINE 1: ...le, un.node_name AS upstream_node_name       FROM repmgr.nod...
                                                           ^

[postgres@atsikham-scale-test-postgresql-vm-2 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster crosscheck
ERROR: unable to retrieve any node records
  • After 3->2 nodes downscale, everything is ok

  • Scaling 2->3 then (because of no disks removed on previous step):

    ERROR cli.engine.terraform.TerraformCommand - Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="ConflictingUserInput" Message="Disk atsikham-scale-test-postgresql-vm-2-os-disk already exists in resource group ATSIKHAM-SCALE-TEST-RG. Only CreateOption.Attach is supported." Target="/subscriptions/2d60775f-932a-4cf6-b9f0-548a8b43b368/resourceGroups/atsikham-scale-test-rg/providers/Microsoft.Compute/disks/atsikham-scale-test-postgresql-vm-2-os-disk"
    
  • After disk removal and 2->3 scale attempt there are still 2 nodes, but standby node is replaced:

    [postgres@atsikham-scale-test-postgresql-vm-0 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster crosscheck
                                 Name | Id |  1 |  2
    ------------------------------------+----+----+----
    atsikham-scale-test-postgresql-vm-0 |  1 |  ? |  ?
    atsikham-scale-test-postgresql-vm-2 |  2 |  ? |  ?
    [postgres@atsikham-scale-test-postgresql-vm-0 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster show
     ID | Name                                | Role    | Status    | Upstream                            | Location | Connection string
    ----+-------------------------------------+---------+-----------+-------------------------------------+----------+-------------------------------------------------------------------
     1  | atsikham-scale-test-postgresql-vm-0 | primary | * running |                                     | default  | host=10.1.6.4 user=epi_repmgr dbname=epi_repmgr connect_timeout=2
     2  | atsikham-scale-test-postgresql-vm-2 | standby |   running | atsikham-scale-test-postgresql-vm-0 | default  | host=10.1.6.5 user=epi_repmgr dbname=epi_repmgr connect_timeout=2
    

Summary

Scaling up/down processes do not work as expected. Disks should be removed after downscale.
The same issue repeats during scaling from 2 to 3 nodes.

Load balancer

Upscale

As expected, LBs are in place.

Downscale

Disks are not removed.

Summary

No any clustering, just nodes addition/removal. Need to remove disks after downscale.
Tested scheme: 1->3->1->0.

RabbitMQ

Verification commands

# check cluster status
rabbitmq-diagnostics cluster_status

# create user
rabbitmqctl add_user testuser testpassword
rabbitmqctl set_user_tags testuser administrator
rabbitmqctl set_permissions -p / testuser ".*" ".*" ".*"

# create vhost
rabbitmqctl add_vhost Some_Virtual_Host
rabbitmqctl set_permissions -p Some_Virtual_Host guest ".*" ".*" ".*"

# create queue
rabbitmq-plugins enable rabbitmq_management
wget http://127.0.0.1:15672/cli/rabbitmqadmin
chmod +x rabbitmqadmin
./rabbitmqadmin declare queue --vhost=Some_Virtual_Host name=some_outgoing_queue durable=true

# create exchange
./rabbitmqadmin declare exchange --vhost=Some_Virtual_Host name=some_exchange type=direct

# create binding
./rabbitmqadmin --vhost=Some_Virtual_Host declare binding source=some_exchange destination_type=queue destination=some_outgoing_queue routing_key=some_routing_key

# check queues and vhosts count
rabbitmqctl status

# list
rabbitmqctl list_bindings --vhost "Some_Virtual_Host"
rabbitmqctl list_vhosts
rabbitmqctl list_queues --vhost "Some_Virtual_Host"

Scheme: 1 (disabled clustering) ->3->2->3->0

  • Ok: listing nodes/listeners.

  • Ok: such data as users/vhosts/exchanges/queues is not lost after 3->2 downscale.

  • Ok: 2->3 nodes scale.

  • Not ok: no disks removed after scaling down.

  • Not ok: no previous data (vhosts/queues) is saved when increase 1 node with disabled clustering to a few nodes with enabled clustering. Following part was added to config:

    ---
    kind: configuration/rabbitmq
    title: "RabbitMQ"
    name: default
    provider: azure
    specification:
      cluster:
        is_clustered: true
  • Not ok: after 3->2 downscale still 3 disk nodes are listed via rabbitmq-diagnostics cluster_status.

  • Not ok: scaling 2->3 after 3->2:

    ERROR cli.engine.terraform.TerraformCommand - Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="ConflictingUserInput" Message="Disk atsikham-scale-test-rabbitmq-vm-2-os-disk already exists in resource group ATSIKHAM-SCALE-TEST-RG. Only CreateOption.Attach is supported." Target="/subscriptions/2d60775f-932a-4cf6-b9f0-548a8b43b368/resourceGroups/atsikham-scale-test-rg/providers/Microsoft.Compute/disks/atsikham-scale-test-rabbitmq-vm-2-os-disk"
    

    After disk removal:

    ERROR cli.engine.ansible.AnsibleCommand - fatal: [atsikham-scale-test-rabbitmq-vm-2]: FAILED! => {"changed": true, "cmd": ["rabbitmqctl", "join_cluster", "rabbit@atsikham-scale-test-rabbitmq-vm-0"], "delta": "0:00:00.390728", "end": "2020-08-19 12:23:34.504568", "msg": "non-zero return code", "rc": 69, "start": "2020-08-19 12:23:34.113840", "stderr": "Error:\n{:inconsistent_cluster, 'Node \\'rabbit@atsikham-scale-test-rabbitmq-vm-0\\' thinks it\\'s clustered with node \\'rabbit@atsikham-scale-test-rabbitmq-vm-2\\', but \\'rabbit@atsikham-scale-test-rabbitmq-vm-2\\' disagrees'}", "stderr_lines": ["Error:", "{:inconsistent_cluster, 'Node \\'rabbit@atsikham-scale-test-rabbitmq-vm-0\\' thinks it\\'s clustered with node \\'rabbit@atsikham-scale-test-rabbitmq-vm-2\\', but \\'rabbit@atsikham-scale-test-rabbitmq-vm-2\\' disagrees'}"], "stdout": "Clustering node rabbit@atsikham-scale-test-rabbitmq-vm-2 with rabbit@atsikham-scale-test-rabbitmq-vm-0", "stdout_lines": ["Clustering node rabbit@atsikham-scale-test-rabbitmq-vm-2 with rabbit@atsikham-scale-test-rabbitmq-vm-0"]}
    

    After manual node removal rabbitmqctl forget_cluster_node rabbit@atsikham-scale-test-rabbitmq-vm-) and another attempt:

    • nodes list is correct
    • topics/vhosts/users are not lost

Summary

Update of 1 node with no replication to a few replicated leads to data loss. Disks should be removed after downscale.

Ignite

Upscale

As expected:

  • ignite service is active on each node
  • /opt/ignite/config/default-config.xml of each node contains correct IP addresses list

Downscale

Disks are not removed.

Summary

With scheme: 1->3->2 everything almost works as expected.

Opendistro for elasticsearch

Verification commands

# check elasticsearch cluser information
curl -k -XGET https://10.1.10.4:9200 -u admin:admin
curl -k -XGET https://10.1.10.4:9200/_cat/indices?v -u admin:admin
curl -k -XGET https://10.1.10.4:9200/_cluster/health -u admin:admin

Upscale

As expected:

  • elasticsearch and filebeat services are started
  • elasticsearch cluster health is ok, correct nodes number
  • indices are in place (auditlog and opendistro_security)

Downscale

Disks are not removed.

Single machine

Cannot be scaled up or deployed alongside other types of cluster.

@atsikham
Copy link
Contributor

atsikham commented Aug 21, 2020

Tasks that were created:
#1574
#1575
#1576
#1577
#1577
#1578
#1579
#1580

Documentation part will be done in #1496

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docs Area for documentation improvement and addition type/bug
Projects
None yet
Development

No branches or pull requests

4 participants