Skip to content

Commit

Permalink
Design docs (hitachienergy#554)
Browse files Browse the repository at this point in the history
* Draft design doc for backup feature

* Backup component design proposal

* Design doc - updated table of contents

* Update to design docs: Backups, Cache-storage, offline-upgrade
  • Loading branch information
toszo authored and erzetpe committed Oct 11, 2019
1 parent efa5c90 commit 6f9e011
Show file tree
Hide file tree
Showing 5 changed files with 231 additions and 0 deletions.
3 changes: 3 additions & 0 deletions docs/design-docs/backup/backup_component.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
148 changes: 148 additions & 0 deletions docs/design-docs/backup/backups.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Epiphany Platform backup design document

Affected version: 0.4.x

## Goals

Provide backup functionality for Epiphany Platform - cluster created using epicli tool.

Backup will cover following areas:

1. [Kubernetes cluster backup](#1.-Kubernetes-cluster-backup)

1.1 etcd database

1.2 kubeadm config

1.3 certificates

1.4 persistent volumes

1.5 applications deployed on the cluster

2. [Kafka backup](#2.-Kafka-backup)

2.1 Kafka topic data

2.2 Kafka index

2.3 Zookeeper settings and data

3. [Elastic stack backup](#3.-Elastic-stack-backup)

3.1 Elasticsearch data

3.2 Kibana settings

4. [Monitoring backup](#4.-Monitoring-backup)

4.1 Prometheus data

4.2 Prometheus settings (properties, targets)

4.3 Alertmanager settings

4.4 Grafana settings (datasources, dashboards)

5. [PostgreSQL backup](#5.-PostgreSQL-backup)

5.1 All databases from DB

6. [RabbitMQ settings and user data](#6.-RabbitMQ-settings-and-user-data)

7. [HAProxy settings backup](#7.-HAProxy-settings-backup)

## Use cases

User/background service/job is able to backup whole cluster or backup selected parts and store files in desired location.
There are few options possible to use for storing backup:
- S3
- Azure file storage
- local file
- NFS

Application/tool will create metadata file that will be definition of the backup - information that can be useful for restore tool. This metadata file will be stored within backup file.

Backup is packed to zip/gz/tar.gz file that has timestamp in the name. If name collision occurred `name+'_1'` will be used.

## Example use

```bash
epibackup -b /path/to/build/dir -t /target/location/for/backup
```

Where `-b` is path to build folder that contains Ansible inventory and `-t` contains target path to store backup.

## Backup Component View

![Epiphany backup component](backup_component.png)

User/background service/job executes `epibackup` (code name) application. Application takes parameters:
- `-b`: build directory of existing cluster. Most important is ansible inventory existing in this directory - so it can be assumed that this should be folder of Ansible inventory file.
- `-t`: target location of zip/tar.gz file that will contain backup files and metadata file.

Tool when executed looks for the inventory file in `-b` location and executes backup playbooks. All playbooks are optional, in MVP version it can try to backup all components (it they exists in the inventory). After that, some components can be skipped (by providing additional flag, or parameter to cli).

Tool also produces metadata file that describes backup with time, backed up components and their versions.

## 1. Kubernetes cluster backup

There are few ways of doing backups of existing Kuberntes cluster. Going to take into further research two approaches.

**First**: Backup etcd database and kubeadm config of single master node. Instruction can be found [here](https://elastisys.com/2018/12/10/backup-kubernetes-how-and-why/). Simple solution for that will backup etcd which contains all workload definitions and settings.

**Second**: Use 3rd party software to create a backup like [Heptio Velero](https://velero.io/docs/v1.1.0/support-matrix/) - Apache 2.0 license, [Velero GitHub](https://github.com/vmware-tanzu/velero)

## 2. Kafka backup

Possible options for backing up Kafka broker data and indexes:
1. Mirror using [Kafka Mirror Maker](https://kafka.apache.org/documentation/). It requires second Kafka cluster running independently that will replicate all data (including current offset and consumer groups). It is used mostly for multi-cloud replication.
2. Kafka-connect – use Kafka connect to get all topic and offset data from Kafka an save to it filesystem (NFS, local, S3, ...) called Sink connector.

2.1 [Confluent Kafka connector](https://github.com/confluentinc/kafka-connect-storage-common) – that use Confluent Kafka Community License Agreement
2.2 Use another Open Source connector like [kafka-connect-s3](https://github.com/spredfast/kafka-connect-s3) (BSD) or [kafka-backup](https://github.com/itadventurer/kafka-backup) (Apache 2.0)

3. File system copy: take Kafka broker and ZooKeeper data stored in files and copy it to backup location. It requires Kafka Broker to be stopped. Solution described in Digital Ocean [post](https://www.digitalocean.com/community/tutorials/how-to-back-up-import-and-migrate-your-apache-kafka-data-on-ubuntu-18-04).

## 3. Elastic stack backup

Use built-in features of Elasticsearch to create backup like:

```REST
PUT /_snapshot/my_unverified_backup?verify=false
{
"type": "fs",
"settings": {
"location": "my_unverified_backup_location"
}
}
```

More information can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/modules-snapshots.html).

OpenDistro uses similar way of doing backups - it should be compatible. [OpenDistro backups link](https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/snapshot-restore/).

## 4. Monitoring backup

Prometheus from version 2.1 is able to create data snapshot by doing HTTP request:

```bash
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
```
Snapshot will be created in `<data-dir>/snapshots/SNAPSHOT-NAME-RETURNED-IN-RESPONSE`

[More info](https://prometheus.io/docs/prometheus/2.1/querying/api/#snapshot)

Files like targets and Prometheus/AlertManager settings should be also copied to backup location.

## 5. PostgreSQL backup

Relational DB backup mechanisms are the most mature ones. Simplest solution is to use [standard PostgreSQL backup funtions](https://www.postgresql.org/docs/10/backup.html). Valid option is also to use [pg_dump](https://www.postgresql.org/docs/current/app-pgdump.html).

## 6. RabbitMQ settings and user data

RabbitMQ has [standard way of creating backup](https://www.rabbitmq.com/backup.html).

## 7. HAProxy settings backup

Copy HAProxy configuration files to backup location.
43 changes: 43 additions & 0 deletions docs/design-docs/cache-storage/cache-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Epiphany Platform cache storage design document

Affected version: 0.4.x

## Goals

Provide in-memory cache storage that will be capable of store large amount of data with hight performance.

## Use cases

Platform should provide cache storage for key-value stores, latest value taken from queue (Kafka).

## Architectural decision

Considered options are:
- Apache Ignite
- Redis

Description | Apache Ignite | Redis |
--- | ---| --- |
License | Apache 2.0 | three clause BSD license
Partition method | Sharding | Sharding
Replication | Yes | Master-slave - yes, Master - Master - only enterprise version
Transaction concept | ACID | Optimistic lock |
Data Grid | Yes | N/A |
In-memory DB | Distributed key-value store, in-memory distributed SQL database | key-value store
Integration with RDBMS | Can integrate with any relational DB that supports JDBC driver (Oracle, PostgreSQL, Microsoft SQL Server, and MySQL) | Possible using 3rd party software
Integration with Kafka | Using `Streamer` (Kafka Streamer, MQTT Streamer, ...) possible to insert to cache | Required 3rd party service
Machine learning | Apache Ignite Machine Learning - tools for building predictive ML models | N/A

Based on above - Apache Ignite is not just scalable in-memory cache/database but cache and processing platform which can run transactional, analytical and streaming workloads. While Redis is simpler, Apache Ignite offers lot more features with Apache 2.0 licence.

Choice: **Apache Ignite**

## Design proposal

[MVP] Add Ansible role to `epicli` that installs Apache Ignite and sets up cluster if there is more than one instance. Ansible playbook is also responsible for adding more nodes to existing cluster (scaling).

Possible problems while implementing Ignite clustering:
- Ignite uses multicast for node discovery which is not supported on AWS. Ignite distribution comes with `TcpDiscoveryS3IpFinder` so S3-based discovery can be used.

To consider:
- Deploy Apache Ignite cluster in Kubernetes
3 changes: 3 additions & 0 deletions docs/design-docs/offline-upgrade/epiphany-offline-upgrade.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 34 additions & 0 deletions docs/design-docs/offline-upgrade/offline-upgrade.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Epiphany Platform offline upgrade design document

Affected version: 0.4.x

## Goals

Provide upgrade functionality for Epiphany Platform so Kubernetes and other components can be upgraded when working offline.

## Use cases

Platform should be upgradeable when there is no internet connection. It requires all packages and dependencies to be downloaded on machine that has internet connection and then moved to air-gap server.

## Example use

```bash
epiupgrade -b /path/to/build/dir
```

Where `-b` is path to build folder that contains Ansible inventory.

## Design proposal

MVP for upgrade function will contain Kubernetes upgrade procedure to the latest supported version of Kubernetes. Later it will be extended to all other Epiphany Platform components.

![Epiphany offline upgrade app](epiphany-offline-upgrade.png)

`epiupgrade` application or module takes build path location (directory path that contains Ansible inventory file).

First part of upgrade execution is to download/upload packages to repository so new packages will exist and be ready for upgrade process.
When repository module will finish its work then upgrade Ansible playbooks will be executed.

Upgrade application/module shall implement following functions:
- [MVP] `apply` it will execute upgrade
- `--plan` where there will be no changes made to the cluster - it will return list of changes that will be made during upgrade execution.

0 comments on commit 6f9e011

Please sign in to comment.