[Spike] Research and confirm which Epiphany components scale down correctly/incorrectly #1497

rafzei · 2020-07-28T14:10:06Z

Is your feature request related to a problem? Please describe.
We don't really know which Epiphany components scale down correctly and which are not. It's not documented anywhere.

Describe the solution you'd like:
This spike should help us to provide well described CLUSTER documentation. We should also know what has to be done to provide full autoscaling.
After the spike, we should create new issues to improve/implement autoscaling.

mkyc · 2020-08-11T12:53:56Z

related to #1051

atsikham · 2020-08-20T13:07:04Z

Results are described in the next comment

atsikham · 2020-08-20T13:10:32Z

Scaling status

This document describes results of upscale/downscale processes for main Epiphany components.

Prerequisites

HEAD is 379fb2c6b2b826db047339e6107edb16e07bfa78
Azure platform

Kubernetes master

Upscale

Supported.

Downscale

Not supported with a log entry:

ERROR epicli -ControlPlane downscale is not supported yet. Please revert your 'kubernetes_master' count to previous value or increase it to scale up kubernetes.

Summary

Need to add support for downscale.

Tested scheme: 1->3.

Kubernetes node

Upscale

Upscale is ok, checked nodes list and status via kubectl.

Downscale

No drain part, just removal - nothing in codebase, described in documentation.
No any disks removed.

Summary

Need to add support for downscale.

Tested scheme: 1->3->0.

Logging

Upscale

As expected, but sometimes the number of retries is not enough for kibana : Wait for kibana to be ready step.

Downscale

Disks are not removed.
Often the number of retries is not enough for kibana : Wait for kibana to be ready step.

Summary

Step kibana : Wait for kibana to be ready often fails after all retries.

See opendistro part, elasticsearch cluster runs as expected. Additionally it was checked that kibana, filebeat, prometheus-node-exporter services are active.

As in other components, disks should be removed after downscale.

Tested scheme: 1->2->3->1->0.

Monitoring

Upscale

Supported, it was verified that prometheus, grafana-server, prometheus-node-exporter, filebeat services are active.

Downscale

Disks are not removed.

Summary

Disks should be removed after downscale.

Tested scheme: 1->3->2->0.

Kafka

Verification commands

systemctl status kafka
systemctl status zookeeper

# list brockers
echo dump | nc localhost 2181 - check available brokers
/opt/zookeeper-3.4.12/bin/zkCli.sh -server localhost:2181 ls /brokers/ids

# list topics
/opt/kafka_2.12-2.3.1/bin/kafka-topics.sh --zookeeper localhost:2181 --list

# create a topic
/opt/kafka_2.12-2.3.1/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic1

Set replication factor greater than number of available brokers to see number of available brokers.

ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1. (kafka.admin.TopicCommand$)

Configuration files

/opt/kafka/config/server.properties
/opt/zookeeper/conf/zoo.cfg

Results of `1->3->2` scale

Ok: broker ids in /opt/kafka/config/server.properties are unique
Ok: when topic is created on one node, it's available on all other nodes
Not ok: after scaling up from 1 to 3 nodes there are all services started, but only 2 kafka broker ids available, no previously created topics are available, not possible to create a topic with replication factor greater than 2

Not ok: after scaling down from 3 to 2 nodes only 1 broker is available and topics list is not the same on different nodes

root@atsikham-scale-test-kafka-vm-1:~# /opt/kafka_2.12-2.3.1/bin/kafka-topics.sh --zookeeper localhost:2181 --list
topic2

root@atsikham-scale-test-kafka-vm-0:~# /opt/kafka_2.12-2.3.1/bin/kafka-topics.sh --zookeeper localhost:2181 --list
topic1
topic2

Not ok: disks are not removed after downscale

Summary

Scaling up/down processes do not work as expected. Disks should be removed after downscale.

PostgreSQL

Verification commands

ln -s /etc/postgresql/10/main/repmgr.conf /etc/repmgr.conf
sudo -i -u postgres
/usr/lib/postgresql/10/bin/repmgr cluster show
/usr/lib/postgresql/10/bin/repmgr cluster crosscheck

Scheme: 1 (no replication) ->3->2->3->0

After scaling up from 1 node with no replication to 3 replicated nodes with following configuration changes

---
kind: configuration/postgresql
title: PostgreSQL
name: default
provider: azure
specification:
  extensions:
    replication:
      enabled: true

1 node is not joined to the cluster:

[postgres@atsikham-scale-test-postgresql-vm-0 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster show
ID | Name                                | Role    | Status    | Upstream                            | Location | Connection string
----+-------------------------------------+---------+-----------+-------------------------------------+----------+-------------------------------------------------------------------
1  | atsikham-scale-test-postgresql-vm-0 | primary | * running |                                     | default  | host=10.1.6.4 user=epi_repmgr dbname=epi_repmgr connect_timeout=2
2  | atsikham-scale-test-postgresql-vm-1 | standby |   running | atsikham-scale-test-postgresql-vm-0 | default  | host=10.1.6.6 user=epi_repmgr dbname=epi_repmgr connect_timeout=2

[postgres@atsikham-scale-test-postgresql-vm-0 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster crosscheck
                               Name | Id |  1 |  2
------------------------------------+----+----+----
atsikham-scale-test-postgresql-vm-0 |  1 |  ? |  ?
atsikham-scale-test-postgresql-vm-1 |  2 |  ? |  ?


[postgres@atsikham-scale-test-postgresql-vm-2 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster show
ERROR: unable to retrieve node records
DETAIL: ERROR:  relation "repmgr.nodes" does not exist
LINE 1: ...le, un.node_name AS upstream_node_name       FROM repmgr.nod...
                                                           ^

[postgres@atsikham-scale-test-postgresql-vm-2 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster crosscheck
ERROR: unable to retrieve any node records

After 3->2 nodes downscale, everything is ok

Scaling 2->3 then (because of no disks removed on previous step):

ERROR cli.engine.terraform.TerraformCommand - Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="ConflictingUserInput" Message="Disk atsikham-scale-test-postgresql-vm-2-os-disk already exists in resource group ATSIKHAM-SCALE-TEST-RG. Only CreateOption.Attach is supported." Target="/subscriptions/2d60775f-932a-4cf6-b9f0-548a8b43b368/resourceGroups/atsikham-scale-test-rg/providers/Microsoft.Compute/disks/atsikham-scale-test-postgresql-vm-2-os-disk"

After disk removal and 2->3 scale attempt there are still 2 nodes, but standby node is replaced:

[postgres@atsikham-scale-test-postgresql-vm-0 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster crosscheck
                             Name | Id |  1 |  2
------------------------------------+----+----+----
atsikham-scale-test-postgresql-vm-0 |  1 |  ? |  ?
atsikham-scale-test-postgresql-vm-2 |  2 |  ? |  ?
[postgres@atsikham-scale-test-postgresql-vm-0 ~]$ /usr/lib/postgresql/10/bin/repmgr cluster show
 ID | Name                                | Role    | Status    | Upstream                            | Location | Connection string
----+-------------------------------------+---------+-----------+-------------------------------------+----------+-------------------------------------------------------------------
 1  | atsikham-scale-test-postgresql-vm-0 | primary | * running |                                     | default  | host=10.1.6.4 user=epi_repmgr dbname=epi_repmgr connect_timeout=2
 2  | atsikham-scale-test-postgresql-vm-2 | standby |   running | atsikham-scale-test-postgresql-vm-0 | default  | host=10.1.6.5 user=epi_repmgr dbname=epi_repmgr connect_timeout=2

Summary

Scaling up/down processes do not work as expected. Disks should be removed after downscale.
The same issue repeats during scaling from 2 to 3 nodes.

Load balancer

Upscale

As expected, LBs are in place.

Downscale

Disks are not removed.

Summary

No any clustering, just nodes addition/removal. Need to remove disks after downscale.
Tested scheme: 1->3->1->0.

RabbitMQ

Verification commands

# check cluster status
rabbitmq-diagnostics cluster_status

# create user
rabbitmqctl add_user testuser testpassword
rabbitmqctl set_user_tags testuser administrator
rabbitmqctl set_permissions -p / testuser ".*" ".*" ".*"

# create vhost
rabbitmqctl add_vhost Some_Virtual_Host
rabbitmqctl set_permissions -p Some_Virtual_Host guest ".*" ".*" ".*"

# create queue
rabbitmq-plugins enable rabbitmq_management
wget http://127.0.0.1:15672/cli/rabbitmqadmin
chmod +x rabbitmqadmin
./rabbitmqadmin declare queue --vhost=Some_Virtual_Host name=some_outgoing_queue durable=true

# create exchange
./rabbitmqadmin declare exchange --vhost=Some_Virtual_Host name=some_exchange type=direct

# create binding
./rabbitmqadmin --vhost=Some_Virtual_Host declare binding source=some_exchange destination_type=queue destination=some_outgoing_queue routing_key=some_routing_key

# check queues and vhosts count
rabbitmqctl status

# list
rabbitmqctl list_bindings --vhost "Some_Virtual_Host"
rabbitmqctl list_vhosts
rabbitmqctl list_queues --vhost "Some_Virtual_Host"

Scheme: 1 (disabled clustering) ->3->2->3->0

Ok: listing nodes/listeners.
Ok: such data as users/vhosts/exchanges/queues is not lost after 3->2 downscale.
Ok: 2->3 nodes scale.
Not ok: no disks removed after scaling down.
Not ok: no previous data (vhosts/queues) is saved when increase 1 node with disabled clustering to a few nodes with enabled clustering. Following part was added to config:
```
---
kind: configuration/rabbitmq
title: "RabbitMQ"
name: default
provider: azure
specification:
  cluster:
    is_clustered: true
```
Not ok: after 3->2 downscale still 3 disk nodes are listed via rabbitmq-diagnostics cluster_status.

Not ok: scaling 2->3 after 3->2:

ERROR cli.engine.terraform.TerraformCommand - Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="ConflictingUserInput" Message="Disk atsikham-scale-test-rabbitmq-vm-2-os-disk already exists in resource group ATSIKHAM-SCALE-TEST-RG. Only CreateOption.Attach is supported." Target="/subscriptions/2d60775f-932a-4cf6-b9f0-548a8b43b368/resourceGroups/atsikham-scale-test-rg/providers/Microsoft.Compute/disks/atsikham-scale-test-rabbitmq-vm-2-os-disk"

After disk removal:

ERROR cli.engine.ansible.AnsibleCommand - fatal: [atsikham-scale-test-rabbitmq-vm-2]: FAILED! => {"changed": true, "cmd": ["rabbitmqctl", "join_cluster", "rabbit@atsikham-scale-test-rabbitmq-vm-0"], "delta": "0:00:00.390728", "end": "2020-08-19 12:23:34.504568", "msg": "non-zero return code", "rc": 69, "start": "2020-08-19 12:23:34.113840", "stderr": "Error:\n{:inconsistent_cluster, 'Node \\'rabbit@atsikham-scale-test-rabbitmq-vm-0\\' thinks it\\'s clustered with node \\'rabbit@atsikham-scale-test-rabbitmq-vm-2\\', but \\'rabbit@atsikham-scale-test-rabbitmq-vm-2\\' disagrees'}", "stderr_lines": ["Error:", "{:inconsistent_cluster, 'Node \\'rabbit@atsikham-scale-test-rabbitmq-vm-0\\' thinks it\\'s clustered with node \\'rabbit@atsikham-scale-test-rabbitmq-vm-2\\', but \\'rabbit@atsikham-scale-test-rabbitmq-vm-2\\' disagrees'}"], "stdout": "Clustering node rabbit@atsikham-scale-test-rabbitmq-vm-2 with rabbit@atsikham-scale-test-rabbitmq-vm-0", "stdout_lines": ["Clustering node rabbit@atsikham-scale-test-rabbitmq-vm-2 with rabbit@atsikham-scale-test-rabbitmq-vm-0"]}

After manual node removal rabbitmqctl forget_cluster_node rabbit@atsikham-scale-test-rabbitmq-vm-) and another attempt:

nodes list is correct
topics/vhosts/users are not lost

Summary

Update of 1 node with no replication to a few replicated leads to data loss. Disks should be removed after downscale.

Ignite

Upscale

As expected:

ignite service is active on each node
/opt/ignite/config/default-config.xml of each node contains correct IP addresses list

Downscale

Disks are not removed.

Summary

With scheme: 1->3->2 everything almost works as expected.

Opendistro for elasticsearch

Verification commands

# check elasticsearch cluser information
curl -k -XGET https://10.1.10.4:9200 -u admin:admin
curl -k -XGET https://10.1.10.4:9200/_cat/indices?v -u admin:admin
curl -k -XGET https://10.1.10.4:9200/_cluster/health -u admin:admin

Upscale

As expected:

elasticsearch and filebeat services are started
elasticsearch cluster health is ok, correct nodes number
indices are in place (auditlog and opendistro_security)

Downscale

Disks are not removed.

Single machine

Cannot be scaled up or deployed alongside other types of cluster.

atsikham · 2020-08-21T09:28:31Z

Tasks that were created:
#1574
#1575
#1576
#1577
#1577
#1578
#1579
#1580

Documentation part will be done in #1496

rafzei added type/bug status/grooming-needed labels Jul 28, 2020

rafzei added this to the S20200729 milestone Jul 28, 2020

rafzei added the area/docs Area for documentation improvement and addition label Jul 29, 2020

mkyc modified the milestones: S20200729, S20200813 Jul 29, 2020

seriva removed the status/grooming-needed label Aug 3, 2020

mkyc modified the milestones: S20200813, S20200827 Aug 13, 2020

atsikham self-assigned this Aug 17, 2020

mkyc mentioned this issue Aug 26, 2020

[EPIC] Components scalability #1588

Closed

mkyc closed this as completed Aug 27, 2020

rafzei mentioned this issue Jun 25, 2021

[FEATURE REQUEST] Verify & update documentation about scaling up/down #2396

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spike] Research and confirm which Epiphany components scale down correctly/incorrectly #1497

[Spike] Research and confirm which Epiphany components scale down correctly/incorrectly #1497

rafzei commented Jul 28, 2020

mkyc commented Aug 11, 2020

atsikham commented Aug 20, 2020

atsikham commented Aug 20, 2020 •

edited

Loading

atsikham commented Aug 21, 2020 •

edited

Loading

[Spike] Research and confirm which Epiphany components scale down correctly/incorrectly #1497

[Spike] Research and confirm which Epiphany components scale down correctly/incorrectly #1497

Comments

rafzei commented Jul 28, 2020

mkyc commented Aug 11, 2020

atsikham commented Aug 20, 2020

atsikham commented Aug 20, 2020 • edited Loading

Scaling status

Prerequisites

Kubernetes master

Upscale

Downscale

Summary

Kubernetes node

Upscale

Downscale

Summary

Logging

Upscale

Downscale

Summary

Monitoring

Upscale

Downscale

Summary

Kafka

Verification commands

Configuration files

Results of 1->3->2 scale

Summary

PostgreSQL

Verification commands

Scheme: 1 (no replication) ->3->2->3->0

Summary

Load balancer

Upscale

Downscale

Summary

RabbitMQ

Verification commands

Scheme: 1 (disabled clustering) ->3->2->3->0

Summary

Ignite

Upscale

Downscale

Summary

Opendistro for elasticsearch

Verification commands

Upscale

Downscale

Single machine

atsikham commented Aug 21, 2020 • edited Loading

atsikham commented Aug 20, 2020 •

edited

Loading

Results of `1->3->2` scale

atsikham commented Aug 21, 2020 •

edited

Loading