Design docs (hitachienergy#554)

* Draft design doc for backup feature * Backup component design proposal * Design doc - updated table of contents * Update to design docs: Backups, Cache-storage, offline-upgrade
to-bar · Oct 11, 2019 · 6f9e011 · 6f9e011
1 parent efa5c90
commit 6f9e011
Show file tree

Hide file tree

Showing 5 changed files with 231 additions and 0 deletions.
diff --git a/docs/design-docs/backup/backup_component.png b/docs/design-docs/backup/backup_component.png
diff --git a/docs/design-docs/backup/backups.md b/docs/design-docs/backup/backups.md
@@ -0,0 +1,148 @@
+# Epiphany Platform backup design document
+
+Affected version: 0.4.x
+
+## Goals
+
+Provide backup functionality for Epiphany Platform - cluster created using epicli tool.
+
+Backup will cover following areas:
+
+1. [Kubernetes cluster backup](#1.-Kubernetes-cluster-backup)
+
+    1.1 etcd database
+
+    1.2 kubeadm config
+
+    1.3 certificates
+
+    1.4 persistent volumes
+
+    1.5 applications deployed on the cluster
+
+2. [Kafka backup](#2.-Kafka-backup)
+
+    2.1 Kafka topic data
+
+    2.2 Kafka index
+
+    2.3 Zookeeper settings and data
+
+3. [Elastic stack backup](#3.-Elastic-stack-backup)
+
+    3.1 Elasticsearch data
+
+    3.2 Kibana settings
+
+4. [Monitoring backup](#4.-Monitoring-backup)
+
+    4.1 Prometheus data
+
+    4.2 Prometheus settings (properties, targets)
+
+    4.3 Alertmanager settings
+
+    4.4 Grafana settings (datasources, dashboards)
+
+5. [PostgreSQL backup](#5.-PostgreSQL-backup)
+
+    5.1 All databases from DB
+
+6. [RabbitMQ settings and user data](#6.-RabbitMQ-settings-and-user-data)
+
+7. [HAProxy settings backup](#7.-HAProxy-settings-backup)
+
+## Use cases
+
+User/background service/job is able to backup whole cluster or backup selected parts and store files in desired location.
+There are few options possible to use for storing backup:
+- S3
+- Azure file storage
+- local file
+- NFS
+
+Application/tool will create metadata file that will be definition of the backup - information that can be useful for restore tool. This metadata file will be stored within backup file.
+
+Backup is packed to zip/gz/tar.gz file that has timestamp in the name. If name collision occurred `name+'_1'` will be used.  
+
+## Example use
+
+```bash
+epibackup -b /path/to/build/dir -t /target/location/for/backup
+```
+
+Where `-b` is path to build folder that contains Ansible inventory and `-t` contains target path to store backup.
+
+## Backup Component View
+
+![Epiphany backup component](backup_component.png)
+
+User/background service/job executes `epibackup` (code name) application. Application takes parameters:
+- `-b`: build directory of existing cluster. Most important is ansible inventory existing in this directory - so it can be assumed that this should be folder of Ansible inventory file.
+- `-t`: target location of zip/tar.gz file that will contain backup files and metadata file.
+
+Tool when executed looks for the inventory file in `-b` location and executes backup playbooks. All playbooks are optional, in MVP version it can try to backup all components (it they exists in the inventory). After that, some components can be skipped (by providing additional flag, or parameter to cli).
+
+Tool also produces metadata file that describes backup with time, backed up components and their versions.
+
+## 1. Kubernetes cluster backup
+
+There are few ways of doing backups of existing Kuberntes cluster. Going to take into further research two approaches.
+
+**First**: Backup etcd database and kubeadm config of single master node. Instruction can be found [here](https://elastisys.com/2018/12/10/backup-kubernetes-how-and-why/). Simple solution for that will backup etcd which contains all workload definitions and settings.
+
+**Second**: Use 3rd party software to create a backup like [Heptio Velero](https://velero.io/docs/v1.1.0/support-matrix/) - Apache 2.0 license, [Velero GitHub](https://github.com/vmware-tanzu/velero)
+
+## 2. Kafka backup
+
+Possible options for backing up Kafka broker data and indexes:
+1. Mirror using [Kafka Mirror Maker](https://kafka.apache.org/documentation/). It requires second Kafka cluster running independently that will replicate all data (including current offset and consumer groups). It is used mostly for multi-cloud replication.
+2. Kafka-connect – use Kafka connect to get all topic and offset data from Kafka an save to it filesystem (NFS, local, S3, ...) called Sink connector.
+
+    2.1 [Confluent Kafka connector](https://github.com/confluentinc/kafka-connect-storage-common) – that use Confluent Kafka Community License Agreement  
+    2.2 Use another Open Source connector like [kafka-connect-s3](https://github.com/spredfast/kafka-connect-s3) (BSD) or [kafka-backup](https://github.com/itadventurer/kafka-backup) (Apache 2.0)
+
+3. File system copy: take Kafka broker and ZooKeeper data stored in files and copy it to backup location. It requires Kafka Broker to be stopped. Solution described in Digital Ocean [post](https://www.digitalocean.com/community/tutorials/how-to-back-up-import-and-migrate-your-apache-kafka-data-on-ubuntu-18-04).
+
+## 3. Elastic stack backup
+
+Use built-in features of Elasticsearch to create backup like:
+
+```REST
+PUT /_snapshot/my_unverified_backup?verify=false
+{
+  "type": "fs",
+  "settings": {
+    "location": "my_unverified_backup_location"
+  }
+}
+```
+
+More information can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/modules-snapshots.html).
+
+OpenDistro uses similar way of doing backups - it should be compatible. [OpenDistro backups link](https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/snapshot-restore/).
+
+## 4. Monitoring backup
+
+Prometheus from version 2.1 is able to create data snapshot by doing HTTP request:
+
+```bash
+curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
+```
+Snapshot will be created in `<data-dir>/snapshots/SNAPSHOT-NAME-RETURNED-IN-RESPONSE`
+
+[More info](https://prometheus.io/docs/prometheus/2.1/querying/api/#snapshot)
+
+Files like targets and Prometheus/AlertManager settings should be also copied to backup location.
+
+## 5. PostgreSQL backup
+
+Relational DB backup mechanisms are the most mature ones. Simplest solution is to use [standard PostgreSQL backup funtions](https://www.postgresql.org/docs/10/backup.html). Valid option is also to use [pg_dump](https://www.postgresql.org/docs/current/app-pgdump.html).
+
+## 6. RabbitMQ settings and user data
+
+RabbitMQ has [standard way of creating backup](https://www.rabbitmq.com/backup.html).
+
+## 7. HAProxy settings backup
+
+Copy HAProxy configuration files to backup location.
diff --git a/docs/design-docs/cache-storage/cache-storage.md b/docs/design-docs/cache-storage/cache-storage.md
@@ -0,0 +1,43 @@
+# Epiphany Platform cache storage design document
+
+Affected version: 0.4.x
+
+## Goals
+
+Provide in-memory cache storage that will be capable of store large amount of data with hight performance.
+
+## Use cases
+
+Platform should provide cache storage for key-value stores, latest value taken from queue (Kafka).
+
+## Architectural decision
+
+Considered options are:
+- Apache Ignite
+- Redis
+
+Description | Apache Ignite | Redis |
+--- | ---| --- |
+License | Apache 2.0 | three clause BSD license
+Partition method | Sharding | Sharding
+Replication | Yes | Master-slave - yes, Master - Master - only enterprise version
+Transaction concept | ACID | Optimistic lock |
+Data Grid | Yes | N/A |
+In-memory DB | Distributed key-value store, in-memory distributed SQL database | key-value store
+Integration with RDBMS | Can integrate with any relational DB that supports JDBC driver (Oracle, PostgreSQL, Microsoft SQL Server, and MySQL) | Possible using 3rd party software
+Integration with Kafka | Using `Streamer` (Kafka Streamer, MQTT Streamer, ...) possible to insert to cache | Required 3rd party service
+Machine learning | Apache Ignite Machine Learning - tools for building predictive ML models | N/A
+
+Based on above - Apache Ignite is not just scalable in-memory cache/database but cache and processing platform which can run transactional, analytical and streaming workloads. While Redis is simpler, Apache Ignite offers lot more features with Apache 2.0 licence.
+
+Choice: **Apache Ignite**
+
+## Design proposal
+
+[MVP] Add Ansible role to `epicli` that installs Apache Ignite and sets up cluster if there is more than one instance. Ansible playbook is also responsible for adding more nodes to existing cluster (scaling).
+
+Possible problems while implementing Ignite clustering:
+- Ignite uses multicast for node discovery which is not supported on AWS. Ignite distribution comes with `TcpDiscoveryS3IpFinder` so S3-based discovery can be used.
+
+To consider:
+- Deploy Apache Ignite cluster in Kubernetes
diff --git a/docs/design-docs/offline-upgrade/epiphany-offline-upgrade.png b/docs/design-docs/offline-upgrade/epiphany-offline-upgrade.png
diff --git a/docs/design-docs/offline-upgrade/offline-upgrade.md b/docs/design-docs/offline-upgrade/offline-upgrade.md
@@ -0,0 +1,34 @@
+# Epiphany Platform offline upgrade design document
+
+Affected version: 0.4.x
+
+## Goals
+
+Provide upgrade functionality for Epiphany Platform so Kubernetes and other components can be upgraded when working offline.
+
+## Use cases
+
+Platform should be upgradeable when there is no internet connection. It requires all packages and dependencies to be downloaded on machine that has internet connection and then moved to air-gap server.
+
+## Example use
+
+```bash
+epiupgrade -b /path/to/build/dir
+```
+
+Where `-b` is path to build folder that contains Ansible inventory.
+
+## Design proposal
+
+MVP for upgrade function will contain Kubernetes upgrade procedure to the latest supported version of Kubernetes. Later it will be extended to all other Epiphany Platform components.
+
+![Epiphany offline upgrade app](epiphany-offline-upgrade.png)
+
+`epiupgrade` application or module takes build path location (directory path that contains Ansible inventory file).
+
+First part of upgrade execution is to download/upload packages to repository so new packages will exist and be ready for upgrade process.
+When repository module will finish its work then upgrade Ansible playbooks will be executed.
+
+Upgrade application/module shall implement following functions:
+- [MVP] `apply` it will execute upgrade
+- `--plan` where there will be no changes made to the cluster - it will return list of changes that will be made during upgrade execution.